[antlr-interest] Problem when parsing numerics
Thomas Woelfle
thomas.woelfle at interactive-objects.com
Tue Feb 24 23:39:46 PST 2009
Hi Gavin,
thanks for the tip. Using a syntactic predicate works. But to me this
seems to be a bug in the algorithm that examines the minimal amount of
lookahread since it calculates a wrong minimal lookahead, isn't it?
Regards,
Thomas
> At 21:54 19/02/2009, Thomas Woelfle wrote:
> >But even when factoring out "." into the DOT lexer rule the result
> >is the same.
> >I think my problem is not the parser but the lexer. As far as I
> >know a lexer usually tokenizes an input stream always trying to
> >find the longest valid token. In my case valid tokens are NUMERICs
> >and DOTs. Where a NUMERIC can be "5", "5.5", "123.56".
> >"5." is not a NUMERIC. Having that input stream I would have
> >expected the lexer to tokenize it into two tokens NUMERIC "5"
> >and DOT ".".
>
> To understand why it's behaving like that, you need to understand the
> algorithm that ANTLR is currently using for the lexer.
>
> Basically, whenever it wants to generate a token, it calls the mTOKENS
> method, which is effectively a rule that makes the following assumptions:
> - all top-level lexer rules are possible alternatives
> - exactly one token must be matched
> - if none of the alternatives can be matched, then skip one
> character and try again
>
> Coupled with this is a performance criterion that has it examine the
> minimal amount of lookahead it can get away with to unambiguously
> decide between two rules or alternatives.
>
> The upshot of all this is that when faced with the input "5.", the
> mTOKENS rule will always select NUMERIC since it's the only rule that
> allows the input to begin with digits. Then the NUMERIC rule will
> consume both the digits and the dot because when faced with the
> decision whether to match the optional alt ('.' DIGIT+)? or to match
> nothing, it only uses one character of lookahead (the dot), due to the
> minimal-lookahead rule.
>
> It's basically seeing it as a choice between a consume-nothing
> alternative and a consume-dot alternative, so the obvious condition is
> whether the next character is a dot or not. It doesn't look "past"
> that and realise that if the dot isn't followed by a digit then it
> can't possibly match the full alt successfully.
>
> You can force it to use two characters of lookahead via a syntactic
> predicate, though:
> NUMERIC : DIGIT+ (('.' DIGIT) => '.' DIGIT+)? ;
>
--
Interactive Objects Software GmbH
Basler Strasse 61
79100 Freiburg, Germany
Phone: +49 761 400 73 0
mailto:thomas.woelfle at interactive-objects.com
------------------------------------------------------------------------
Interactive Objects' Legacy Modernization Solutions
Get Your Applications SOA-Ready!
See http://www.interactive-objects.com/ for more information.
------------------------------------------------------------------------
Interactive Objects Software GmbH | Freiburg | Geschäftsführer: Alberto Perandones, Andrea Hemprich
| AG Frbg. HRB 5810 | USt-ID: DE 197983057
More information about the antlr-interest
mailing list