[antlr-interest] Lexical error recovery by manual symbol (character) insertion/deletion?

Fri Feb 15 07:43:11 PST 2008

Hi all.

Least, but not least... this 'fix' only works if the now imaginary 'I' token
is *not* declared in the tokens section.
Declaring the token in the tokens section will 'break' the 'fix'.

Regards,

Darach.

On Fri, Feb 15, 2008 at 3:34 PM, Darach Ennis <darach at gmail.com> wrote:

> Hi all.
>
> After some trial and error and a little brain-stretching the following
> seems to work:
>
> F:   ('0' | '1'..'9' '0'..'9'*)
>     (
>         { input.LA(1) == '.' && Character.isDigit(input.LA(2)) }?=> ('.'
> '0'..'9'+) { _type = F; }
>         |   { _type = I; }
>     )
>     ;
>
> So collapsing the common fragments of the Integer and Float lexer rules
> into
> a common rule and gating the '.' appropriately seems to resolve the issue.
> Of
> course, now I is imaginary... I've noticed that this works only if the
> rule is a
> non-fragment rule:
>
> [antlr3] ANTLR Parser Generator  Version 3.0.1 (August 13, 2007)
> 1989-2007
>    [antlr3] warning(105): /tmp/FragmentFloat.g:35:94: no lexer rule
> corresponding to token: I
>     [javac] Compiling 2 source files to
> /play/eclipse/workspace/jerlang/build
>     [javac]
> /play/eclipse/workspace/jerlang/src/org/pojodyne/jerlang/antlr/testing/TestFloatLexer.java:143:
> cannot find symbol
>     [javac] symbol  : variable _type
>     [javac] location: class
> org.pojodyne.jerlang.antlr.testing.TestFloatLexer
>     [javac]                      _type = F;
>     [javac]                      ^
>     [javac]
> /play/eclipse/workspace/jerlang/src/org/pojodyne/jerlang/antlr/testing/TestFloatLexer.java:150:
> cannot find symbol
>     [javac] symbol  : variable _type
>     [javac] location: class
> org.pojodyne.jerlang.antlr.testing.TestFloatLexer
>     [javac]                      _type = I;
>     [javac]                      ^
>     [javac] 2 errors
>
> Perhaps this is related to the fact that fragment lexer rules do not
> accept parameters? The _type field should
> be defined in lexer fragment rules so that ambiguity such as the above can
> be resolved without making a
> rule public.
>
> So in answer to my own question: Lexical recovery is most likely a sign of
> an inflexible brain, not an inflexible ANTLR, at least in this case.
>
> Regards,
>
> Darach.
>
>
> On Fri, Feb 15, 2008 at 2:14 PM, Darach Ennis <darach at gmail.com> wrote:
>
> > Hi all.
> >
> > I have a small testcase grammar as follows which can correctly match
> > integer and floats and dots
> > unless any <number><dot><non-number> sequence is in the input stream
> > wherein it tries to match
> > a float, fails and issues an error due to not finding a lexical subrule
> > alternative.
> >
> > Input phrase example:
> >
> > 1 2.3 .4 9. 9...
> >
> > Output errors:
> > line 1:11 required (...)+ loop did not match anything at character ' '
> > // aka: the character proceeding the occurance of '9.' above, a space
> > line 1:14 required (...)+ loop did not match anything at character '.'
> > // aka: the character proceeding the second occurance of '9.' above, another
> > dot
> >
> > Note that '.4' above is not a float but a dot followed by the integer
> > four. This gets matched correctly.
> >
> > As '.' is used as a terminal for a statement and a lone integer is a
> > valid expression the sequence of
> > tokens <digit><dot> is valid.
> >
> > The grammar:
> >
> > test:   literal+;
> > literal:    I | D | F;
> > I   :   ('0' | '1'..'9' '0'..'9'*) ;
> > F :   ('0' | '1'..'9' '0'..'9'*) '.' '0'..'9'+;
> > D   :   '.';
> > WS  :   (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; };
> > C   :   '#' ~('\n'|'\r')* ('\r'|'\n') { $channel=HIDDEN; };
> >
> > Introducing an erroneous rule to mop up the subrule mismatch is about
> > the only strategy that seems to work:
> >
> > test:   literal+;
> > literal:    I | D | F | ERR;
> > I   :   ('0' | '1'..'9' '0'..'9'*);
> > F :   ('0' | '1'..'9' '0'..'9'*) '.' '0'..'9'+;
> > D   :   '.';
> > ERR: I D;
> > WS  :   (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; };
> > C   :   '#' ~('\n'|'\r')* ('\r'|'\n') { $channel=HIDDEN; };
> >
> > This might be just fine in simpler grammars. However, I'm looking for
> > something akin to error recovery
> > by symbol insertion but at the lexer/character level, not in the parser
> > as described in the book. In my
> > case any dot character preceeded with a digit but not followed by a
> > digit should be preceeded by a
> > whitespace:
> >
> > 2.3 -> <number><dot><number>
> > .3 -> <dot> <number>
> > 3. -> . <number> <whitespace> <dot>
> >
> > Thus in the third production we would avoid the lexical ambiguity simply
> > by separating the 'mismatched float' <integer> and <dot>
> > tokens by an intervening whitespace. However, it looks like manual error
> > recovery in lexer rules is not supported by ANTLRv3, at
> > least in the java grammar. Here's a modified (and probably illegal) 'F'
> > lexer rule:
> >
> > F :   ('0' | '1'..'9' '0'..'9'*) '.' '0'..'9'+
> >     ;
> >     catch [RecognitionException re] {
> >         // recover?
> >     }
> >
> > ANTLR will try and generate code for this, but there are missing
> > templates for the error recovery:
> >
> > ANTLR Parser Generator  Version 3.0.1 (August 13, 2007)  1989-2007
> > error(10):  internal error: /tmp/BadDot.g :
> > java.util.NoSuchElementException: no such attribute: exceptions in
> > template context [lexerRule]
> > org.antlr.stringtemplate.StringTemplate.rawSetAttribute(
> > StringTemplate.java:661)
> > org.antlr.stringtemplate.StringTemplate.setAttribute(StringTemplate.java
> > :522)
> > org.antlr.stringtemplate.StringTemplate.setAttribute(StringTemplate.java
> > :604)
> > org.antlr.stringtemplate.StringTemplate.setAttribute(StringTemplate.java
> > :565)
> > org.antlr.codegen.CodeGenTreeWalker.exceptionHandler(
> > CodeGenTreeWalker.java:1413)
> > org.antlr.codegen.CodeGenTreeWalker.exceptionGroup(
> > CodeGenTreeWalker.java:1103)
> > org.antlr.codegen.CodeGenTreeWalker.rule(CodeGenTreeWalker.java:805)
> > org.antlr.codegen.CodeGenTreeWalker.rules(CodeGenTreeWalker.java:544)
> > org.antlr.codegen.CodeGenTreeWalker.grammarSpec(CodeGenTreeWalker.java
> > :486)
> > org.antlr.codegen.CodeGenTreeWalker.grammar(CodeGenTreeWalker.java:297)
> > org.antlr.codegen.CodeGenerator.genRecognizer(CodeGenerator.java:406)
> > org.antlr.Tool.processGrammar(Tool.java:347)
> > org.antlr.Tool.process(Tool.java:311)
> > org.antlr.Tool.main(Tool.java:70)
> >
> > Is there a simpler fix to the dot-ambiguity that I'm missing? Would a
> > lexical error recovery mechanism
> > be justifiably used in this case? Or, is this user error or a
> > limitation/bug with ANTLRv3?
> >
> > Regards,
> >
> > Darach.
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080215/eba1c11a/attachment.html