[antlr-interest] Problem when parsing numerics

Thomas Woelfle thomas.woelfle at interactive-objects.com
Tue Feb 24 23:39:46 PST 2009


Hi Gavin,

thanks for the tip. Using a syntactic predicate works. But to me this 
seems to be a bug in the algorithm that examines the minimal amount of 
lookahread since it calculates a wrong minimal lookahead, isn't it?

Regards,
Thomas

> At 21:54 19/02/2009, Thomas Woelfle wrote:
> >But even when factoring out "." into the DOT lexer rule the result
> >is the same.
> >I think my problem is not the parser but the lexer. As far as I
> >know a lexer usually tokenizes an input stream always trying to
> >find the longest valid token. In my case valid tokens are NUMERICs
> >and DOTs. Where a NUMERIC can be "5", "5.5", "123.56".
> >"5." is not a NUMERIC. Having that input stream I would have
> >expected the lexer to tokenize it into two tokens NUMERIC "5"
> >and DOT ".".
>
> To understand why it's behaving like that, you need to understand the 
> algorithm that ANTLR is currently using for the lexer.
>
> Basically, whenever it wants to generate a token, it calls the mTOKENS 
> method, which is effectively a rule that makes the following assumptions:
>   - all top-level lexer rules are possible alternatives
>   - exactly one token must be matched
>   - if none of the alternatives can be matched, then skip one 
> character and try again
>
> Coupled with this is a performance criterion that has it examine the 
> minimal amount of lookahead it can get away with to unambiguously 
> decide between two rules or alternatives.
>
> The upshot of all this is that when faced with the input "5.", the 
> mTOKENS rule will always select NUMERIC since it's the only rule that 
> allows the input to begin with digits.  Then the NUMERIC rule will 
> consume both the digits and the dot because when faced with the 
> decision whether to match the optional alt ('.' DIGIT+)? or to match 
> nothing, it only uses one character of lookahead (the dot), due to the 
> minimal-lookahead rule.
>
> It's basically seeing it as a choice between a consume-nothing 
> alternative and a consume-dot alternative, so the obvious condition is 
> whether the next character is a dot or not.  It doesn't look "past" 
> that and realise that if the dot isn't followed by a digit then it 
> can't possibly match the full alt successfully.
>
> You can force it to use two characters of lookahead via a syntactic 
> predicate, though:
>   NUMERIC : DIGIT+ (('.' DIGIT) => '.' DIGIT+)? ;
>


-- 
Interactive Objects Software GmbH
Basler Strasse 61
79100 Freiburg, Germany

Phone:  +49 761 400 73 0
mailto:thomas.woelfle at interactive-objects.com


------------------------------------------------------------------------

Interactive Objects' Legacy Modernization Solutions 

Get Your Applications SOA-Ready!

See http://www.interactive-objects.com/ for more information.

------------------------------------------------------------------------


Interactive Objects Software GmbH | Freiburg | Geschäftsführer: Alberto Perandones, Andrea Hemprich
| AG Frbg. HRB 5810 | USt-ID: DE 197983057



More information about the antlr-interest mailing list