[antlr-interest] overlapping lexer rules
Curtis Clauson
NOSPAM at TheSnakePitDev.com
Wed Nov 7 20:39:33 PST 2007
The second example is technically valid because of the undocumented fact
that the *ORDER* of lexer rules is also significant.
For the second example, AntLR creates a DFA (Definite Finite Automata)
to look-ahead as far as it takes to find which token rule matches. If
more than one match, it will choose the first listed in the grammar.
This is why the INT rule will match "12" even though the FLOAT rule also
matches.
If you were to swap the order of INT and FLOAT in the grammar, "12"
would be matched as a FLOAT, and INT would never be matched by anything.
(Tested using AntLr v3.0.1 and ANTLRWorks v1.1.4)
It is worthy of note that this is an incredibly poor example of defining
a float token. In this context, a float token cannot exist without a
decimal point ('.'). The following would create a more efficient
look-ahead DFA:
INT : '0'..'9'+;
FLOAT: '0'..'9'+ '.' '0'..'9'*;
Also, it does not allow for a float that starts with a decimal point
instead of a digit. The following does:
INT : '0'..'9'+;
FLOAT: '0'..'9'+ '.' '0'..'9'*
| '.' '0'..'9'+;
The preceding also eliminates any dependency on grammar order. You can
swap the order of these two rules and they will still be correctly parsed.
As to why this poor example is in the book, and why the effect of rule
order on the lexer is not appropriately documented - ya got me.
I hope that helps
-- Curtis
cimbroken wrote:
> quoting two examples from the book (pg.280-281):
>
> 1)
> INT : DIGIT+ ;
> DIGIT : '0'..'9' ;
>
> 2)
> INT : '0'..'9' +;
> FLOAT : '0'..'9' + ('.' '0'..'9'*)? ;
>
> I don't understand very well why the second is *not* a mistake. It seems
> to me that this two examples are similar: 2 "free" rules (not fragment)
> that try to match different tokens that start with the same character.
> Why antlr treats them in different ways?
More information about the antlr-interest
mailing list