[antlr-interest] Lexical error recovery by manual symbol (character) insertion/deletion?

Gavin Lambert antlr at mirality.co.nz
Fri Feb 15 11:42:46 PST 2008


At 04:34 16/02/2008, Darach Ennis wrote:
>After some trial and error and a little brain-stretching the 
>following seems to work:
>
>F:   ('0' | '1'..'9' '0'..'9'*)
>     (
>         { input.LA(1) == '.' && Character.isDigit(input.LA(2)) 
> }?=> ('.' '0'..'9'+) { _type = F; }
>         |   { _type = I; }
>     )
>     ;

First: don't use _type (that's an implementation detail).  Use 
$type instead.

Second: solutions to this issue have been posted several times 
before; a common alternative solution is:

fragment DIGIT: '0'..'9';
fragment NUMBER: DIGIT+;
fragment FLOAT: NUMBER DOT NUMBER;
INT
   :  NUMBER
      ( (DOT DIGIT) => DOT NUMBER { $type = FLOAT; } )?
   ;

(Or you could replace that first NUMBER in the INT rule with ('0' 
| '1'..'9' DIGIT*) if you wanted to ensure leading zeros were 
invalid.)

The actual contents of the FLOAT rule don't matter, though it's 
usually preferable to make it look similar to what it's going to 
represent.

FLOAT can actually be put into the tokens section instead, but 
only if it has no content (since if it has content it becomes a 
top-level rule, which isn't the goal); unfortunately doing this 
causes ANTLR to emit a warning at present, which is why the dummy 
fragment approach is usually preferred.

>The _type field should be defined in lexer fragment rules so that 
>ambiguity such as the above can be resolved without making a rule 
>public.

Lexer fragment rules never emit tokens, so $type is completely 
meaningless for them.  Any type-juggling must be done in the 
top-level rule.



More information about the antlr-interest mailing list