[antlr-interest] Problem when parsing numerics

Gavin Lambert antlr at mirality.co.nz
Fri Feb 20 14:32:43 PST 2009


At 21:54 19/02/2009, Thomas Woelfle wrote:
 >But even when factoring out "." into the DOT lexer rule the 
result
 >is the same.
 >I think my problem is not the parser but the lexer. As far as I
 >know a lexer usually tokenizes an input stream always trying to
 >find the longest valid token. In my case valid tokens are 
NUMERICs
 >and DOTs. Where a NUMERIC can be "5", "5.5", "123.56".
 >"5." is not a NUMERIC. Having that input stream I would have
 >expected the lexer to tokenize it into two tokens NUMERIC "5"
 >and DOT ".".

To understand why it's behaving like that, you need to understand 
the algorithm that ANTLR is currently using for the lexer.

Basically, whenever it wants to generate a token, it calls the 
mTOKENS method, which is effectively a rule that makes the 
following assumptions:
   - all top-level lexer rules are possible alternatives
   - exactly one token must be matched
   - if none of the alternatives can be matched, then skip one 
character and try again

Coupled with this is a performance criterion that has it examine 
the minimal amount of lookahead it can get away with to 
unambiguously decide between two rules or alternatives.

The upshot of all this is that when faced with the input "5.", the 
mTOKENS rule will always select NUMERIC since it's the only rule 
that allows the input to begin with digits.  Then the NUMERIC rule 
will consume both the digits and the dot because when faced with 
the decision whether to match the optional alt ('.' DIGIT+)? or to 
match nothing, it only uses one character of lookahead (the dot), 
due to the minimal-lookahead rule.

It's basically seeing it as a choice between a consume-nothing 
alternative and a consume-dot alternative, so the obvious 
condition is whether the next character is a dot or not.  It 
doesn't look "past" that and realise that if the dot isn't 
followed by a digit then it can't possibly match the full alt 
successfully.

You can force it to use two characters of lookahead via a 
syntactic predicate, though:
   NUMERIC : DIGIT+ (('.' DIGIT) => '.' DIGIT+)? ;



More information about the antlr-interest mailing list