[antlr-interest] lexer problem (BUG?)

Fri Jul 27 09:52:32 PDT 2007

On 7/28/07, Daniel Brosseau <daniel at lba.ca> wrote:
> How about this very simple example I tried in ANTLRWorks 1.1:
>
> grammar lex;
> fragment KEYWORD  :  'a' 'b' 'c';
> fragment OTHER : 'a'|'b'|'c'|'d';
> TOKEN : (KEYWORD)=> KEYWORD { $type = KEYWORD; }
>                | OTHER { $type = OTHER; };
> token : TOKEN;
> program : token*;
>
Either remove the type setting or try something like:
grammar lex;

fragment KEYWORD  :  'a' 'b' 'c';
fragment OTHER : 'a'|'b'|'c'|'d';
TOKEN : (KEYWORD)=> KEYWORD { $type = KEYWORD; }
              | OTHER { $type = OTHER; };
token : OTHER;
keyword:	KEYWORD;
program : (token|keyword)+ EOF;

Which will correctly handle everything in the debugger and give a
MismatchedTokenException in the interpreter as without the actions
it's returning only TOKEN.

> With input: "abd", the interpreter breaks up the input into 'ab' and 'd'.
>
> Now if the interpreter does not execute predicates then I can see that it
> would not have seen the (KEYWORD) predicate and would have choked after 'ab'
>
> With input "abc", the interpreter breaks up the input into 'ab' and 'c'.
>
> But here it should have eaten up 'abc' regardless... even with input 'abcd'
> it breaks it up into 'ab' 'c' and 'd' and not 'abc' and 'd'.
Not sure what the interpreter's doing here. Might be an interpreter
bug. Looks like nothing should match 'ab'. Given no actions I thought
the interpreter should be running:
grammar lex;

fragment KEYWORD  :  'a' 'b' 'c';
fragment OTHER : 'a'|'b'|'c'|'d';
TOKEN : KEYWORD
      | OTHER;
token : TOKEN;
program : token*;

Using this the interpreter output does line up with the debugger.

Tom.
>
> With input "abcd", in the debugger I get
> root
> program
> org.antlr.runtime.EarlyExitException
>
> and the input "abcd" is in a red box in the Input window. The Output window
> had:
> line 1:0 required (...)+ loop did not match anything at input 'abc'
>
> None of this seems right. What am I missing?
>
> Daniel
>
> > On 7/28/07, Ruth Karl <ruth.karl at gmx.de> wrote:
> > Strange. I am able to correctly parse "<set><script>test</script><s>"
> > after making the modifications I gave. Note the interpreter in
> > ANTLRWorks doesn't execute actions or predicates so it won't work
> > there.
> > The full grammar I used was:
> ...
> >> > On 7/27/07, Ruth Karl <ruth.karl at gmx.de> wrote:
> >> >
> >> > It's not a bug. Though it may be considered a limitation.
> >> > The problem is that ANTLR's prediction algorithm doesn't look past
> >> > token boundaries so it makes it's predictions based on only a single
> >> > token. As the only possible single token matches for '<' followed by
> >> > anything are JAVASCRIPT and OPENTAG (talking about your original
> >> > grammar here, not the shorter sample) as soon as ANTLR see's '<s' it
> >> > predicts that it must be JAVASCRIPT, then gives an error when that
> >> > won't match. Looking at the mTokens method ANTLR generates may help
> >> > you see what is going on The problem is discussed in
> >> > http://www.antlr.org/pipermail/antlr-interest/2007-July/022349.html
> >> > .
> >> > Unfortunately as ANTLR doesn't consider there to be any ambiguity
> >> > backtracking won't help and a predicate in OPENTAG won't be hoisted. A
> >> > fix for your original grammar is to replace the previous rules with:
> >> > fragment
> >> > JAVASCRIPT      :       '<script' ( options {greedy=false;} : . )*
> >> > '</script>'
> >> >        ;
> >> > OPENTAG         :       ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
> >> >                               |               '<'
> >> >        ;
> >> >
> >> > Ter said he'd investigate the possibility of enhancing the prediction
> >> > algorithm to deal with such cases.
> >> >
> >> > Tom.
> >> >
> >> >> Thanks for any further suggestions,
> >> >>
> >> >> Ruth
>
>