[antlr-interest] question about lexer rules

Sat Dec 19 17:33:34 PST 2009

At 22:24 19/12/2009, codeman at bytefusion.de wrote:
 >A question to lexer rules and its priorities. Is there any
 >dependency between order of lexer rule definitions?
[...]
 >My understanding of lexer rules is, the best rule will
 >match. The best rule is the rule matching the most
 >characters. But what about TIME and IDENTIFIER_LOWER? Both
 >may match the same input sequence.

Both are true.  In general, the best match will win.  But in cases 
where two rules can match the same input, then the one listed 
first will win.

There are also some complications involved related to how ANTLR 
generates the lookahead code; it stops looking ahead once it sees 
enough input to make it eliminate all other rules, which is 
sometimes early enough to get it into trouble with certain kinds 
of input (hence the trouble with INT vs. FLOAT tokens discussed 
here repeatedly).

I think in your case it'll be ok, but it's possible that ANTLR 
might get into trouble with certain kinds of input -- for example, 
"12h53" might be seen as a malformed TIME rather than a TIME 
followed by a NUMBER.

There are some problems in that grammar, though.

1. The DIGIT, LOWERCASE, and UPPERCASE should almost certainly be 
marked as fragment rules, since you don't really want to get 
individual DIGIT or LOWERCASE tokens in the parser.

2. The IDENTIFIER_UPPER rule should use + instead of *; using * 
means that a valid IDENTIFIER_UPPER can contain zero characters, 
which can mean that ANTLR will get into an infinite loop producing 
IDENTIFIER_UPPER tokens without consuming any input.  In general, 
no top-level lexer rule should ever permit zero consumption.

3. You have both a NEWLINE and a WS rule matching the same 
characters, one skipped and one not skipped.  If newlines are 
significant to the parser then you should remove them from the WS 
rule; if they're not then you should remove the NEWLINE rule, or 
make it a fragment.

4. Your two identifier rules specify that identifiers cannot 
contain digits, nor can they be mixed-case.  Is this actually what 
you wanted?

5. In the TIME rule, you are using + in a very bizarre 
way.  Remember, it denotes repetition, not concatenation.  Are you 
really trying to say that "12hhhhhh25mmm" is a valid TIME?

6. You should left-factor the TIME rule, so that all of the 
alternatives with a common left prefix are expressed together (ie. 
have the common left prefix followed by optional 
alternatives).  This reduces the amount of lookahead ANTLR 
requires, improves performance, and helps to reduce problem 
ambiguity cases.