[antlr-interest] Problem when parsing numerics
Gavin Lambert
antlr at mirality.co.nz
Fri Feb 20 14:32:43 PST 2009
At 21:54 19/02/2009, Thomas Woelfle wrote:
>But even when factoring out "." into the DOT lexer rule the
result
>is the same.
>I think my problem is not the parser but the lexer. As far as I
>know a lexer usually tokenizes an input stream always trying to
>find the longest valid token. In my case valid tokens are
NUMERICs
>and DOTs. Where a NUMERIC can be "5", "5.5", "123.56".
>"5." is not a NUMERIC. Having that input stream I would have
>expected the lexer to tokenize it into two tokens NUMERIC "5"
>and DOT ".".
To understand why it's behaving like that, you need to understand
the algorithm that ANTLR is currently using for the lexer.
Basically, whenever it wants to generate a token, it calls the
mTOKENS method, which is effectively a rule that makes the
following assumptions:
- all top-level lexer rules are possible alternatives
- exactly one token must be matched
- if none of the alternatives can be matched, then skip one
character and try again
Coupled with this is a performance criterion that has it examine
the minimal amount of lookahead it can get away with to
unambiguously decide between two rules or alternatives.
The upshot of all this is that when faced with the input "5.", the
mTOKENS rule will always select NUMERIC since it's the only rule
that allows the input to begin with digits. Then the NUMERIC rule
will consume both the digits and the dot because when faced with
the decision whether to match the optional alt ('.' DIGIT+)? or to
match nothing, it only uses one character of lookahead (the dot),
due to the minimal-lookahead rule.
It's basically seeing it as a choice between a consume-nothing
alternative and a consume-dot alternative, so the obvious
condition is whether the next character is a dot or not. It
doesn't look "past" that and realise that if the dot isn't
followed by a digit then it can't possibly match the full alt
successfully.
You can force it to use two characters of lookahead via a
syntactic predicate, though:
NUMERIC : DIGIT+ (('.' DIGIT) => '.' DIGIT+)? ;
More information about the antlr-interest
mailing list