[antlr-interest] Lexical error recovery by manual symbol (character) insertion/deletion?
Darach Ennis
darach at gmail.com
Fri Feb 15 07:34:55 PST 2008
Hi all.
After some trial and error and a little brain-stretching the following seems
to work:
F: ('0' | '1'..'9' '0'..'9'*)
(
{ input.LA(1) == '.' && Character.isDigit(input.LA(2)) }?=> ('.'
'0'..'9'+) { _type = F; }
| { _type = I; }
)
;
So collapsing the common fragments of the Integer and Float lexer rules into
a common rule and gating the '.' appropriately seems to resolve the issue.
Of
course, now I is imaginary... I've noticed that this works only if the rule
is a
non-fragment rule:
[antlr3] ANTLR Parser Generator Version 3.0.1 (August 13, 2007) 1989-2007
[antlr3] warning(105): /tmp/FragmentFloat.g:35:94: no lexer rule
corresponding to token: I
[javac] Compiling 2 source files to
/play/eclipse/workspace/jerlang/build
[javac]
/play/eclipse/workspace/jerlang/src/org/pojodyne/jerlang/antlr/testing/TestFloatLexer.java:143:
cannot find symbol
[javac] symbol : variable _type
[javac] location: class
org.pojodyne.jerlang.antlr.testing.TestFloatLexer
[javac] _type = F;
[javac] ^
[javac]
/play/eclipse/workspace/jerlang/src/org/pojodyne/jerlang/antlr/testing/TestFloatLexer.java:150:
cannot find symbol
[javac] symbol : variable _type
[javac] location: class
org.pojodyne.jerlang.antlr.testing.TestFloatLexer
[javac] _type = I;
[javac] ^
[javac] 2 errors
Perhaps this is related to the fact that fragment lexer rules do not accept
parameters? The _type field should
be defined in lexer fragment rules so that ambiguity such as the above can
be resolved without making a
rule public.
So in answer to my own question: Lexical recovery is most likely a sign of
an inflexible brain, not an inflexible ANTLR, at least in this case.
Regards,
Darach.
On Fri, Feb 15, 2008 at 2:14 PM, Darach Ennis <darach at gmail.com> wrote:
> Hi all.
>
> I have a small testcase grammar as follows which can correctly match
> integer and floats and dots
> unless any <number><dot><non-number> sequence is in the input stream
> wherein it tries to match
> a float, fails and issues an error due to not finding a lexical subrule
> alternative.
>
> Input phrase example:
>
> 1 2.3 .4 9. 9...
>
> Output errors:
> line 1:11 required (...)+ loop did not match anything at character ' ' //
> aka: the character proceeding the occurance of '9.' above, a space
> line 1:14 required (...)+ loop did not match anything at character '.' //
> aka: the character proceeding the second occurance of '9.' above, another
> dot
>
> Note that '.4' above is not a float but a dot followed by the integer
> four. This gets matched correctly.
>
> As '.' is used as a terminal for a statement and a lone integer is a valid
> expression the sequence of
> tokens <digit><dot> is valid.
>
> The grammar:
>
> test: literal+;
> literal: I | D | F;
> I : ('0' | '1'..'9' '0'..'9'*) ;
> F : ('0' | '1'..'9' '0'..'9'*) '.' '0'..'9'+;
> D : '.';
> WS : (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; };
> C : '#' ~('\n'|'\r')* ('\r'|'\n') { $channel=HIDDEN; };
>
> Introducing an erroneous rule to mop up the subrule mismatch is about the
> only strategy that seems to work:
>
> test: literal+;
> literal: I | D | F | ERR;
> I : ('0' | '1'..'9' '0'..'9'*);
> F : ('0' | '1'..'9' '0'..'9'*) '.' '0'..'9'+;
> D : '.';
> ERR: I D;
> WS : (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; };
> C : '#' ~('\n'|'\r')* ('\r'|'\n') { $channel=HIDDEN; };
>
> This might be just fine in simpler grammars. However, I'm looking for
> something akin to error recovery
> by symbol insertion but at the lexer/character level, not in the parser as
> described in the book. In my
> case any dot character preceeded with a digit but not followed by a digit
> should be preceeded by a
> whitespace:
>
> 2.3 -> <number><dot><number>
> .3 -> <dot> <number>
> 3. -> . <number> <whitespace> <dot>
>
> Thus in the third production we would avoid the lexical ambiguity simply
> by separating the 'mismatched float' <integer> and <dot>
> tokens by an intervening whitespace. However, it looks like manual error
> recovery in lexer rules is not supported by ANTLRv3, at
> least in the java grammar. Here's a modified (and probably illegal) 'F'
> lexer rule:
>
> F : ('0' | '1'..'9' '0'..'9'*) '.' '0'..'9'+
> ;
> catch [RecognitionException re] {
> // recover?
> }
>
> ANTLR will try and generate code for this, but there are missing templates
> for the error recovery:
>
> ANTLR Parser Generator Version 3.0.1 (August 13, 2007) 1989-2007
> error(10): internal error: /tmp/BadDot.g :
> java.util.NoSuchElementException: no such attribute: exceptions in
> template context [lexerRule]
> org.antlr.stringtemplate.StringTemplate.rawSetAttribute(
> StringTemplate.java:661)
> org.antlr.stringtemplate.StringTemplate.setAttribute(StringTemplate.java
> :522)
> org.antlr.stringtemplate.StringTemplate.setAttribute(StringTemplate.java
> :604)
> org.antlr.stringtemplate.StringTemplate.setAttribute(StringTemplate.java
> :565)
> org.antlr.codegen.CodeGenTreeWalker.exceptionHandler(
> CodeGenTreeWalker.java:1413)
> org.antlr.codegen.CodeGenTreeWalker.exceptionGroup(CodeGenTreeWalker.java
> :1103)
> org.antlr.codegen.CodeGenTreeWalker.rule(CodeGenTreeWalker.java:805)
> org.antlr.codegen.CodeGenTreeWalker.rules(CodeGenTreeWalker.java:544)
> org.antlr.codegen.CodeGenTreeWalker.grammarSpec(CodeGenTreeWalker.java
> :486)
> org.antlr.codegen.CodeGenTreeWalker.grammar(CodeGenTreeWalker.java:297)
> org.antlr.codegen.CodeGenerator.genRecognizer(CodeGenerator.java:406)
> org.antlr.Tool.processGrammar(Tool.java:347)
> org.antlr.Tool.process(Tool.java:311)
> org.antlr.Tool.main(Tool.java:70)
>
> Is there a simpler fix to the dot-ambiguity that I'm missing? Would a
> lexical error recovery mechanism
> be justifiably used in this case? Or, is this user error or a
> limitation/bug with ANTLRv3?
>
> Regards,
>
> Darach.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080215/93126061/attachment-0001.html
More information about the antlr-interest
mailing list