[antlr-interest] overlapping lexer rules

Wed Nov 7 20:39:33 PST 2007

The second example is technically valid because of the undocumented fact 
that the *ORDER* of lexer rules is also significant.

For the second example, AntLR creates a DFA (Definite Finite Automata) 
to look-ahead as far as it takes to find which token rule matches. If 
more than one match, it will choose the first listed in the grammar. 
This is why the INT rule will match "12" even though the FLOAT rule also 
matches.

If you were to swap the order of INT and FLOAT in the grammar, "12" 
would be matched as a FLOAT, and INT would never be matched by anything. 
(Tested using AntLr v3.0.1 and ANTLRWorks v1.1.4)

It is worthy of note that this is an incredibly poor example of defining 
a float token. In this context, a float token cannot exist without a 
decimal point ('.'). The following would create a more efficient 
look-ahead DFA:

INT  : '0'..'9'+;
FLOAT: '0'..'9'+ '.' '0'..'9'*;

Also, it does not allow for a float that starts with a decimal point 
instead of a digit. The following does:

INT  : '0'..'9'+;
FLOAT: '0'..'9'+ '.' '0'..'9'*
      | '.' '0'..'9'+;

The preceding also eliminates any dependency on grammar order. You can 
swap the order of these two rules and they will still be correctly parsed.

As to why this poor example is in the book, and why the effect of rule 
order on the lexer is not appropriately documented - ya got me.

I hope that helps
-- Curtis

cimbroken wrote:
> quoting two examples from the book (pg.280-281):
> 
> 1)
> INT : DIGIT+ ;
> DIGIT : '0'..'9' ;
> 
> 2)
> INT : '0'..'9' +;
> FLOAT : '0'..'9' + ('.' '0'..'9'*)? ;
> 
> I don't understand very well why the second is *not* a mistake. It seems 
> to me that this two examples are similar: 2 "free" rules (not fragment) 
> that try to match different tokens that start with the same character. 
> Why antlr treats them in different ways?