[antlr-interest] Understanding Lexer rules

Wed Feb 20 00:17:06 PST 2008

At 09:44 20/02/2008, Darien Hager wrote:
>1) It helps to consider lexer rules (token definitions) to be 
>separate from the parser rules, even though they're in the same 
>file.

Yes.  In fact my opinion is that it's best not to use character 
literals in the parser at all (since this helps to reinforce the 
separation between lexer and parser).  There was a big discussion 
on this last week.

>2) Unlike parser rules, the order of appearance matters. (The 
>auto-named tokens generated by literals in parser rules are 
>appended.)

I believe they're prepended, actually.  And the order only sort-of 
matters; a rule that consumes more input will usually win against 
one that consumes less input regardless of order.

>3) The lexer seeks to match the first viable token.

Sort of (see above).

>4) Order your tokens from most specific and complex to least 
>specific and generic.
>5) Ensure that any lexer rules which are only for convenience 
>(and not as fully valid first-class tokenson their own) are 
>marked "fragmentary".

Yes.

>So, for example, I'd put NUMERIC (the specific case) before 
>ALPHANUMERIC in the lexer rules.

I'm not entirely sure whether input such as "42foo" will resolve 
to ALPHANUMERIC or NUMERIC ALPHANUMERIC (it probably depends on 
how the rules are defined).  Either way, you need to word your 
parser rules carefully if NUMERIC is a complete subset of 
ALPHANUMERIC.