[antlr-interest] Newbie question about lex token matching priority
Rob Greene
robgreene at gmail.com
Mon Feb 2 05:25:55 PST 2009
Ok, how about this? The problem I had was I'd define the '(' and ')'
as a lexer rule and then try to use it as such in the grammar. I had
to break it out --
/* Parsing input such as "(#)name" */
grammar CurtCarpenter;
language: data+ ;
data : LPAREN NUMBER RPAREN name NEWLINE ;
name : (CHAR | LPAREN | RPAREN)+ ;
NUMBER : '-'? '0'..'9'+ ;
CHAR : ' ' | '!'..'\'' | '*'..'\u00fe' ;
NEWLINE : '\r'? '\n';
LPAREN : '(';
RPAREN : ')';
Is that close to what you need? It seemed to work in ANTLRWorks and I
checked it against this input:
(1)apple
(2)george
(3)bad (one)
(-5) lots of spaces in here! ((()))
-Rob
On Sun, Feb 1, 2009 at 9:13 AM, Curt Carpenter
<Curt.Carpenter at microsoft.com> wrote:
> Hi Rob, thanks for the reply. I was beginning to wonder whether my email went off into the ether.
>
> The problem with making PARENNUM a parse rule is that it ALSO matches NAME. In fact it would ONLY match NAME, and never the parse rule, since (0) is also a valid NAME (a lex token). So what I finally came up with (not sure this is exactly what I said in my follow-up) is a bunch of different tokens, matching the combinations that a name can appear in:
> ID_NAME_COLON : PARENNUM ('!'..'\u00FE')+ ':'; // (0)curtc:
> ID_NAME : PARENNUM ('!'..'\u00FE')+ ; // (0)curtc
> NAME_COLON : ('!'..'\u00FE')+ ':'; // curtc:
> PARENNUM : '(' NUMBER ')'; // (0)
> NUMBER : '-'? ('0'..'9')+; // 0
> fragment NAME : ('!'..'\u00FE')+; // curtc
> WS : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
>
> Then in parse rules, given whichever token matches, I can know which portion is the name, and whether there is an id, by using different target language actions for each separate token, and trivially cracking. After I think finally wrapping my head around it, I'm pretty sure what I have is reasonable (passes all 250,000+ existing source files), given that I can't change the input. But if you think there's a much better way, I'd still be interested to hear, even though I'm not blocked.
>
> Thanks,
>
> Curt
>
> -----Original Message-----
> From: Rob Greene
> Sent: Saturday, January 31, 2009 10:50 AM
>
> Hi Curt --
>
> First off, I'm a total hack at ANTLR. And that's in the golfing sense
> with grass flying everywhere and not in the geek sense of being
> totally awesome. Awesome is reserved for others on this list. So, if
> I mislead you, sorry in advance!!
>
> Anyway, from your description, think PARENNUM isn't a lexer rule but a
> grammar rule.
>
> So you'd have something like:
>
> language: data+ ;
> data: '(' NUMBER ')' NAME ;
>
> and the same lex rules for NUMBER and NAME as before. Also, don't
> forget a whitespace rule (I assume you need one?)
>
> WS: (' ' | '\t' | '\n' | '\r')+ {skip();}
>
> I hope I'm leading you in the right direction at least!
> -Rob
>
> On Thu, Jan 29, 2009 at 10:14 AM, Curt Carpenter
> <Curt.Carpenter at microsoft.com> wrote:
>> Hi all, I am 1 day in on ANTLR, so be gentle. J
>>
>>
>>
>> I have gone through the tutorials and such, and have created a grammar from
>> scratch, debugged it and have it mostly working, except for one problem. I
>> want to parse something like this:
>>
>> (#)name
>>
>> Where # is a number, but name can be virtually anything except space. I
>> think. I don't own the language, so please don't suggest that name should be
>> further restricted. So I defined the lex rules as so:
>>
>> PARENNUM : '(' NUMBER ')';
>>
>> NUMBER : '-'? ('0'..'9')+;
>>
>> NAME : ('!'..'\u00FE')+; // ansi only
>>
>>
>>
>> You can see the problem at NAME. (0)curt is a valid name. But what I really
>> want is to parse as PARENNUM=(0) NAME=curt. I have a parse rule to match
>> that. But, the lex rules match longest string first, so (0)curt is always
>> tokenized as NAME. Is there any way to change the priority of matching lex
>> tokens to be the order they're defined, rather than order only breaking ties
>> in string length?
>>
>>
>>
>> Or is there some other way to accomplish the simple parse rule I'm trying to
>> solve?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Curt
>
More information about the antlr-interest
mailing list