[antlr-interest] Newbie question about lex token matching priority

Sun Feb 1 07:13:39 PST 2009

Hi Rob, thanks for the reply. I was beginning to wonder whether my email went off into the ether.

The problem with making PARENNUM a parse rule is that it ALSO matches NAME. In fact it would ONLY match NAME, and never the parse rule, since (0) is also a valid NAME (a lex token). So what I finally came up with (not sure this is exactly what I said in my follow-up) is a bunch of different tokens, matching the combinations that a name can appear in:
ID_NAME_COLON	: PARENNUM ('!'..'\u00FE')+ ':';	// (0)curtc:
ID_NAME		: PARENNUM ('!'..'\u00FE')+ ;		// (0)curtc
NAME_COLON		: ('!'..'\u00FE')+ ':';			// curtc:
PARENNUM 		: '(' NUMBER ')';				// (0)
NUMBER 		: '-'? ('0'..'9')+;			// 0
fragment NAME	: ('!'..'\u00FE')+;			// curtc
WS 			: ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;

Then in parse rules, given whichever token matches, I can know which portion is the name, and whether there is an id, by using different target language actions for each separate token, and trivially cracking. After I think finally wrapping my head around it, I'm pretty sure what I have is reasonable (passes all 250,000+ existing source files), given that I can't change the input. But if you think there's a much better way, I'd still be interested to hear, even though I'm not blocked.

Thanks,

Curt

-----Original Message-----
From: Rob Greene
Sent: Saturday, January 31, 2009 10:50 AM

Hi Curt --

First off, I'm a total hack at ANTLR.  And that's in the golfing sense
with grass flying everywhere and not in the geek sense of being
totally awesome.  Awesome is reserved for others on this list.  So, if
I mislead you, sorry in advance!!

Anyway, from your description, think PARENNUM isn't a lexer rule but a
grammar rule.

So you'd have something like:

language: data+ ;
data: '(' NUMBER ')' NAME ;

and the same lex rules for NUMBER and NAME as before.  Also, don't
forget a whitespace rule (I assume you need one?)

WS: (' ' | '\t' | '\n' | '\r')+ {skip();}

I hope I'm leading you in the right direction at least!
-Rob

On Thu, Jan 29, 2009 at 10:14 AM, Curt Carpenter
<Curt.Carpenter at microsoft.com> wrote:
> Hi all, I am 1 day in on ANTLR, so be gentle. J
>
>
>
> I have gone through the tutorials and such, and have created a grammar from
> scratch, debugged it and have it mostly working, except for one problem. I
> want to parse something like this:
>
> (#)name
>
> Where # is a number, but name can be virtually anything except space. I
> think. I don't own the language, so please don't suggest that name should be
> further restricted. So I defined the lex rules as so:
>
> PARENNUM       : '(' NUMBER ')';
>
> NUMBER             : '-'? ('0'..'9')+;
>
> NAME                   : ('!'..'\u00FE')+; // ansi only
>
>
>
> You can see the problem at NAME. (0)curt is a valid name. But what I really
> want is to parse as PARENNUM=(0) NAME=curt. I have a parse rule to match
> that. But, the lex rules match longest string first, so (0)curt is always
> tokenized as NAME. Is there any way to change the priority of matching lex
> tokens to be the order they're defined, rather than order only breaking ties
> in string length?
>
>
>
> Or is there some other way to accomplish the simple parse rule I'm trying to
> solve?
>
>
>
> Thanks,
>
>
>
> Curt