[antlr-interest] Newbie question about lex token matching priority

Curt Carpenter Curt.Carpenter at microsoft.com
Thu Jan 29 09:41:52 PST 2009


After having taken a minute to write the previous email, I got an idea, tried it, and it appears to work. But first, I forgot to mention another snag in my problem. I also want to sometimes match (#)name: where # and name are variable, but the (): are all literal. From the problem mentioned below, the : always gets added to name. My solution for all of these problems, and I'm looking for advice on whether this is reasonable, or whether there is a better alternative, is to simply add tokens for all the possible combinations, like so:
ID_NAME_COLON           : PLAYERNUM NAME ':';
ID_NAME                            : PLAYERNUM NAME;
NAME_COLON                  : NAME ':';
PLAYERNUM                      : '(' NUMBER ')';
NUMBER                             : '-'? ('0'..'9')+;
fragment NAME               : ('!'..'\u00FE')+;

Now, I get a single token for (0)curt or (0)curt: or even just curt: based on the length rule, and that's ok because in parser rules I know which of the tokens I'm looking for. Then in a target action I can use a simple regex to crack the name and number, given that I know which token type it is.

Ok, I slightly lied. The fragment NAME isn't working. I thought fragment was semantically the same as inlining, but it's not. I have to do the manual replacement, otherwise the lexer will try to match NAME using the single rule for NAME, which will consume the trailing literals I need to match the composite tokens. So what I really have is this:
ID_NAME_COLON           : PLAYERNUM ('!'..'\u00FE')+ ':';
ID_NAME                            : PLAYERNUM ('!'..'\u00FE')+ ;
NAME_COLON                  : ('!'..'\u00FE')+ ':';
PLAYERNUM                      : '(' NUMBER ')';
NUMBER                             : '-'? ('0'..'9')+;

This does appear to work. Is this a normal thing to do, or am I off in the weeds?

Thanks again,

Curt

From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Curt Carpenter
Sent: Friday, January 30, 2009 12:15 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Newbie question about lex token matching priority

Hi all, I am 1 day in on ANTLR, so be gentle. :)

I have gone through the tutorials and such, and have created a grammar from scratch, debugged it and have it mostly working, except for one problem. I want to parse something like this:
(#)name
Where # is a number, but name can be virtually anything except space. I think. I don't own the language, so please don't suggest that name should be further restricted. So I defined the lex rules as so:
PARENNUM       : '(' NUMBER ')';
NUMBER             : '-'? ('0'..'9')+;
NAME                   : ('!'..'\u00FE')+; // ansi only

You can see the problem at NAME. (0)curt is a valid name. But what I really want is to parse as PARENNUM=(0) NAME=curt. I have a parse rule to match that. But, the lex rules match longest string first, so (0)curt is always tokenized as NAME. Is there any way to change the priority of matching lex tokens to be the order they're defined, rather than order only breaking ties in string length?

Or is there some other way to accomplish the simple parse rule I'm trying to solve?

Thanks,

Curt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090129/9b5d4251/attachment.html 


More information about the antlr-interest mailing list