[antlr-interest] Newbie question about lex token matching priority

Rob Greene robgreene at gmail.com
Mon Feb 2 05:25:55 PST 2009


Ok, how about this?  The problem I had was I'd define the '(' and ')'
as a lexer rule and then try to use it as such in the grammar.  I had
to break it out --

/* Parsing input such as "(#)name" */
grammar CurtCarpenter;

language:	data+ ;
data	:	LPAREN NUMBER RPAREN name NEWLINE ;
name	:	(CHAR | LPAREN | RPAREN)+ ;

NUMBER	:	'-'? '0'..'9'+ ;
CHAR 	:	' ' | '!'..'\'' | '*'..'\u00fe' ;
NEWLINE	:	'\r'? '\n';
LPAREN	:	'(';
RPAREN	:	')';

Is that close to what you need?  It seemed to work in ANTLRWorks and I
checked it against this input:

(1)apple
(2)george
(3)bad (one)
(-5) lots of spaces in here! ((()))

-Rob

On Sun, Feb 1, 2009 at 9:13 AM, Curt Carpenter
<Curt.Carpenter at microsoft.com> wrote:
> Hi Rob, thanks for the reply. I was beginning to wonder whether my email went off into the ether.
>
> The problem with making PARENNUM a parse rule is that it ALSO matches NAME. In fact it would ONLY match NAME, and never the parse rule, since (0) is also a valid NAME (a lex token). So what I finally came up with (not sure this is exactly what I said in my follow-up) is a bunch of different tokens, matching the combinations that a name can appear in:
> ID_NAME_COLON   : PARENNUM ('!'..'\u00FE')+ ':';        // (0)curtc:
> ID_NAME         : PARENNUM ('!'..'\u00FE')+ ;           // (0)curtc
> NAME_COLON              : ('!'..'\u00FE')+ ':';                 // curtc:
> PARENNUM                : '(' NUMBER ')';                               // (0)
> NUMBER          : '-'? ('0'..'9')+;                     // 0
> fragment NAME   : ('!'..'\u00FE')+;                     // curtc
> WS                      : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
>
> Then in parse rules, given whichever token matches, I can know which portion is the name, and whether there is an id, by using different target language actions for each separate token, and trivially cracking. After I think finally wrapping my head around it, I'm pretty sure what I have is reasonable (passes all 250,000+ existing source files), given that I can't change the input. But if you think there's a much better way, I'd still be interested to hear, even though I'm not blocked.
>
> Thanks,
>
> Curt
>
> -----Original Message-----
> From: Rob Greene
> Sent: Saturday, January 31, 2009 10:50 AM
>
> Hi Curt --
>
> First off, I'm a total hack at ANTLR.  And that's in the golfing sense
> with grass flying everywhere and not in the geek sense of being
> totally awesome.  Awesome is reserved for others on this list.  So, if
> I mislead you, sorry in advance!!
>
> Anyway, from your description, think PARENNUM isn't a lexer rule but a
> grammar rule.
>
> So you'd have something like:
>
> language: data+ ;
> data: '(' NUMBER ')' NAME ;
>
> and the same lex rules for NUMBER and NAME as before.  Also, don't
> forget a whitespace rule (I assume you need one?)
>
> WS: (' ' | '\t' | '\n' | '\r')+ {skip();}
>
> I hope I'm leading you in the right direction at least!
> -Rob
>
> On Thu, Jan 29, 2009 at 10:14 AM, Curt Carpenter
> <Curt.Carpenter at microsoft.com> wrote:
>> Hi all, I am 1 day in on ANTLR, so be gentle. J
>>
>>
>>
>> I have gone through the tutorials and such, and have created a grammar from
>> scratch, debugged it and have it mostly working, except for one problem. I
>> want to parse something like this:
>>
>> (#)name
>>
>> Where # is a number, but name can be virtually anything except space. I
>> think. I don't own the language, so please don't suggest that name should be
>> further restricted. So I defined the lex rules as so:
>>
>> PARENNUM       : '(' NUMBER ')';
>>
>> NUMBER             : '-'? ('0'..'9')+;
>>
>> NAME                   : ('!'..'\u00FE')+; // ansi only
>>
>>
>>
>> You can see the problem at NAME. (0)curt is a valid name. But what I really
>> want is to parse as PARENNUM=(0) NAME=curt. I have a parse rule to match
>> that. But, the lex rules match longest string first, so (0)curt is always
>> tokenized as NAME. Is there any way to change the priority of matching lex
>> tokens to be the order they're defined, rather than order only breaking ties
>> in string length?
>>
>>
>>
>> Or is there some other way to accomplish the simple parse rule I'm trying to
>> solve?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Curt
>


More information about the antlr-interest mailing list