[antlr-interest] Newbie question about lex token matching priority

Curt Carpenter Curt.Carpenter at microsoft.com
Mon Feb 2 07:25:28 PST 2009


Note that ' ' is expressly not allowed in a name. Unfortunately, a name only sometimes has (#) before it, but also sometimes has colon after it. See my lexer rules. And these names (with or without ids and/or colon) have other elements on a line. I only mentioned the part that was not straight-forward. I guess you could make more parse rules like the one you made, using the idea of making a lex rule for a single character (and don't forget to allow NUMBER in your name rule). But the downside is that the token stream gets much longer, with a token for every character in every name. That seems like a clear disadvantage, but I'm less clear on what the upside would be of moving paren matching to parser rule.

BTW, by my  understanding, you needn't complicate the CHAR lex rule to get what you're after (removing the parens). Rather, you can simply put the paren rules ahead of the CHAR rule. This was a bit of frustration for me before I finally got that internalized.

BTW#2, I just noticed that in my previous email's lex rules, the fragment NAME is obsolete. It doesn't work and I stopped using it for reasons I already explained.

-----Original Message-----
From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Rob Greene
Sent: Monday, February 02, 2009 9:26 PM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Newbie question about lex token matching priority

Ok, how about this?  The problem I had was I'd define the '(' and ')'
as a lexer rule and then try to use it as such in the grammar.  I had
to break it out --

/* Parsing input such as "(#)name" */
grammar CurtCarpenter;

language:       data+ ;
data    :       LPAREN NUMBER RPAREN name NEWLINE ;
name    :       (CHAR | LPAREN | RPAREN)+ ;

NUMBER  :       '-'? '0'..'9'+ ;
CHAR    :       ' ' | '!'..'\'' | '*'..'\u00fe' ;
NEWLINE :       '\r'? '\n';
LPAREN  :       '(';
RPAREN  :       ')';

Is that close to what you need?  It seemed to work in ANTLRWorks and I
checked it against this input:

(1)apple
(2)george
(3)bad (one)
(-5) lots of spaces in here! ((()))

-Rob

On Sun, Feb 1, 2009 at 9:13 AM, Curt Carpenter
<Curt.Carpenter at microsoft.com> wrote:
> Hi Rob, thanks for the reply. I was beginning to wonder whether my email went off into the ether.
>
> The problem with making PARENNUM a parse rule is that it ALSO matches NAME. In fact it would ONLY match NAME, and never the parse rule, since (0) is also a valid NAME (a lex token). So what I finally came up with (not sure this is exactly what I said in my follow-up) is a bunch of different tokens, matching the combinations that a name can appear in:
> ID_NAME_COLON   : PARENNUM ('!'..'\u00FE')+ ':';        // (0)curtc:
> ID_NAME         : PARENNUM ('!'..'\u00FE')+ ;           // (0)curtc
> NAME_COLON              : ('!'..'\u00FE')+ ':';                 // curtc:
> PARENNUM                : '(' NUMBER ')';                               // (0)
> NUMBER          : '-'? ('0'..'9')+;                     // 0
> fragment NAME   : ('!'..'\u00FE')+;                     // curtc
> WS                      : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
>
> Then in parse rules, given whichever token matches, I can know which portion is the name, and whether there is an id, by using different target language actions for each separate token, and trivially cracking. After I think finally wrapping my head around it, I'm pretty sure what I have is reasonable (passes all 250,000+ existing source files), given that I can't change the input. But if you think there's a much better way, I'd still be interested to hear, even though I'm not blocked.
>
> Thanks,
>
> Curt
>
> -----Original Message-----
> From: Rob Greene
> Sent: Saturday, January 31, 2009 10:50 AM
>
> Hi Curt --
>
> First off, I'm a total hack at ANTLR.  And that's in the golfing sense
> with grass flying everywhere and not in the geek sense of being
> totally awesome.  Awesome is reserved for others on this list.  So, if
> I mislead you, sorry in advance!!
>
> Anyway, from your description, think PARENNUM isn't a lexer rule but a
> grammar rule.
>
> So you'd have something like:
>
> language: data+ ;
> data: '(' NUMBER ')' NAME ;
>
> and the same lex rules for NUMBER and NAME as before.  Also, don't
> forget a whitespace rule (I assume you need one?)
>
> WS: (' ' | '\t' | '\n' | '\r')+ {skip();}
>
> I hope I'm leading you in the right direction at least!
> -Rob
>
> On Thu, Jan 29, 2009 at 10:14 AM, Curt Carpenter
> <Curt.Carpenter at microsoft.com> wrote:
>> Hi all, I am 1 day in on ANTLR, so be gentle. J
>>
>>
>>
>> I have gone through the tutorials and such, and have created a grammar from
>> scratch, debugged it and have it mostly working, except for one problem. I
>> want to parse something like this:
>>
>> (#)name
>>
>> Where # is a number, but name can be virtually anything except space. I
>> think. I don't own the language, so please don't suggest that name should be
>> further restricted. So I defined the lex rules as so:
>>
>> PARENNUM       : '(' NUMBER ')';
>>
>> NUMBER             : '-'? ('0'..'9')+;
>>
>> NAME                   : ('!'..'\u00FE')+; // ansi only
>>
>>
>>
>> You can see the problem at NAME. (0)curt is a valid name. But what I really
>> want is to parse as PARENNUM=(0) NAME=curt. I have a parse rule to match
>> that. But, the lex rules match longest string first, so (0)curt is always
>> tokenized as NAME. Is there any way to change the priority of matching lex
>> tokens to be the order they're defined, rather than order only breaking ties
>> in string length?
>>
>>
>>
>> Or is there some other way to accomplish the simple parse rule I'm trying to
>> solve?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Curt
>

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address



More information about the antlr-interest mailing list