[antlr-interest] Lexical error recovery by manual symbol (character) insertion/deletion?

Darach Ennis darach at gmail.com
Fri Feb 15 07:34:55 PST 2008


Hi all.

After some trial and error and a little brain-stretching the following seems
to work:

F:   ('0' | '1'..'9' '0'..'9'*)
    (
        { input.LA(1) == '.' && Character.isDigit(input.LA(2)) }?=> ('.'
'0'..'9'+) { _type = F; }
        |   { _type = I; }
    )
    ;

So collapsing the common fragments of the Integer and Float lexer rules into
a common rule and gating the '.' appropriately seems to resolve the issue.
Of
course, now I is imaginary... I've noticed that this works only if the rule
is a
non-fragment rule:

[antlr3] ANTLR Parser Generator  Version 3.0.1 (August 13, 2007)  1989-2007
   [antlr3] warning(105): /tmp/FragmentFloat.g:35:94: no lexer rule
corresponding to token: I
    [javac] Compiling 2 source files to
/play/eclipse/workspace/jerlang/build
    [javac]
/play/eclipse/workspace/jerlang/src/org/pojodyne/jerlang/antlr/testing/TestFloatLexer.java:143:
cannot find symbol
    [javac] symbol  : variable _type
    [javac] location: class
org.pojodyne.jerlang.antlr.testing.TestFloatLexer
    [javac]                      _type = F;
    [javac]                      ^
    [javac]
/play/eclipse/workspace/jerlang/src/org/pojodyne/jerlang/antlr/testing/TestFloatLexer.java:150:
cannot find symbol
    [javac] symbol  : variable _type
    [javac] location: class
org.pojodyne.jerlang.antlr.testing.TestFloatLexer
    [javac]                      _type = I;
    [javac]                      ^
    [javac] 2 errors

Perhaps this is related to the fact that fragment lexer rules do not accept
parameters? The _type field should
be defined in lexer fragment rules so that ambiguity such as the above can
be resolved without making a
rule public.

So in answer to my own question: Lexical recovery is most likely a sign of
an inflexible brain, not an inflexible ANTLR, at least in this case.

Regards,

Darach.

On Fri, Feb 15, 2008 at 2:14 PM, Darach Ennis <darach at gmail.com> wrote:

> Hi all.
>
> I have a small testcase grammar as follows which can correctly match
> integer and floats and dots
> unless any <number><dot><non-number> sequence is in the input stream
> wherein it tries to match
> a float, fails and issues an error due to not finding a lexical subrule
> alternative.
>
> Input phrase example:
>
> 1 2.3 .4 9. 9...
>
> Output errors:
> line 1:11 required (...)+ loop did not match anything at character ' ' //
> aka: the character proceeding the occurance of '9.' above, a space
> line 1:14 required (...)+ loop did not match anything at character '.' //
> aka: the character proceeding the second occurance of '9.' above, another
> dot
>
> Note that '.4' above is not a float but a dot followed by the integer
> four. This gets matched correctly.
>
> As '.' is used as a terminal for a statement and a lone integer is a valid
> expression the sequence of
> tokens <digit><dot> is valid.
>
> The grammar:
>
> test:   literal+;
> literal:    I | D | F;
> I   :   ('0' | '1'..'9' '0'..'9'*) ;
> F :   ('0' | '1'..'9' '0'..'9'*) '.' '0'..'9'+;
> D   :   '.';
> WS  :   (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; };
> C   :   '#' ~('\n'|'\r')* ('\r'|'\n') { $channel=HIDDEN; };
>
> Introducing an erroneous rule to mop up the subrule mismatch is about the
> only strategy that seems to work:
>
> test:   literal+;
> literal:    I | D | F | ERR;
> I   :   ('0' | '1'..'9' '0'..'9'*);
> F :   ('0' | '1'..'9' '0'..'9'*) '.' '0'..'9'+;
> D   :   '.';
> ERR: I D;
> WS  :   (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; };
> C   :   '#' ~('\n'|'\r')* ('\r'|'\n') { $channel=HIDDEN; };
>
> This might be just fine in simpler grammars. However, I'm looking for
> something akin to error recovery
> by symbol insertion but at the lexer/character level, not in the parser as
> described in the book. In my
> case any dot character preceeded with a digit but not followed by a digit
> should be preceeded by a
> whitespace:
>
> 2.3 -> <number><dot><number>
> .3 -> <dot> <number>
> 3. -> . <number> <whitespace> <dot>
>
> Thus in the third production we would avoid the lexical ambiguity simply
> by separating the 'mismatched float' <integer> and <dot>
> tokens by an intervening whitespace. However, it looks like manual error
> recovery in lexer rules is not supported by ANTLRv3, at
> least in the java grammar. Here's a modified (and probably illegal) 'F'
> lexer rule:
>
> F :   ('0' | '1'..'9' '0'..'9'*) '.' '0'..'9'+
>     ;
>     catch [RecognitionException re] {
>         // recover?
>     }
>
> ANTLR will try and generate code for this, but there are missing templates
> for the error recovery:
>
> ANTLR Parser Generator  Version 3.0.1 (August 13, 2007)  1989-2007
> error(10):  internal error: /tmp/BadDot.g :
> java.util.NoSuchElementException: no such attribute: exceptions in
> template context [lexerRule]
> org.antlr.stringtemplate.StringTemplate.rawSetAttribute(
> StringTemplate.java:661)
> org.antlr.stringtemplate.StringTemplate.setAttribute(StringTemplate.java
> :522)
> org.antlr.stringtemplate.StringTemplate.setAttribute(StringTemplate.java
> :604)
> org.antlr.stringtemplate.StringTemplate.setAttribute(StringTemplate.java
> :565)
> org.antlr.codegen.CodeGenTreeWalker.exceptionHandler(
> CodeGenTreeWalker.java:1413)
> org.antlr.codegen.CodeGenTreeWalker.exceptionGroup(CodeGenTreeWalker.java
> :1103)
> org.antlr.codegen.CodeGenTreeWalker.rule(CodeGenTreeWalker.java:805)
> org.antlr.codegen.CodeGenTreeWalker.rules(CodeGenTreeWalker.java:544)
> org.antlr.codegen.CodeGenTreeWalker.grammarSpec(CodeGenTreeWalker.java
> :486)
> org.antlr.codegen.CodeGenTreeWalker.grammar(CodeGenTreeWalker.java:297)
> org.antlr.codegen.CodeGenerator.genRecognizer(CodeGenerator.java:406)
> org.antlr.Tool.processGrammar(Tool.java:347)
> org.antlr.Tool.process(Tool.java:311)
> org.antlr.Tool.main(Tool.java:70)
>
> Is there a simpler fix to the dot-ambiguity that I'm missing? Would a
> lexical error recovery mechanism
> be justifiably used in this case? Or, is this user error or a
> limitation/bug with ANTLRv3?
>
> Regards,
>
> Darach.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080215/93126061/attachment-0001.html 


More information about the antlr-interest mailing list