[antlr-interest] lexer problem (BUG?)

Thomas Brandon tbrandonau at gmail.com
Fri Jul 27 07:29:13 PDT 2007


On 7/27/07, Ruth Karl <ruth.karl at gmx.de> wrote:
> Hi Andrew,
>
> thanks a lot for finding a smaller example to illustrate the problem.
> (Did you do it for java target or for c# - as I did?)
>
> Now: what can I do?
> I could (...) try to find a workaround in my grammar, but if it IS a bug
> - than a similar thing might happen in other cases as well....
>
It's not a bug. Though it may be considered a limitation.
The problem is that ANTLR's prediction algorithm doesn't look past
token boundaries so it makes it's predictions based on only a single
token. As the only possible single token matches for '<' followed by
anything are JAVASCRIPT and OPENTAG (talking about your original
grammar here, not the shorter sample) as soon as ANTLR see's '<s' it
predicts that it must be JAVASCRIPT, then gives an error when that
won't match. Looking at the mTokens method ANTLR generates may help
you see what is going on The problem is discussed in
http://www.antlr.org/pipermail/antlr-interest/2007-July/022349.html
.
Unfortunately as ANTLR doesn't consider there to be any ambiguity
backtracking won't help and a predicate in OPENTAG won't be hoisted. A
fix for your original grammar is to replace the previous rules with:
fragment
JAVASCRIPT      :       '<script' ( options {greedy=false;} : . )* '</script>'
       ;
OPENTAG         :       ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
				|		'<'
       ;

Ter said he'd investigate the possibility of enhancing the prediction
algorithm to deal with such cases.

Tom.
> Thanks for any further suggestions,
>
> Ruth
>
>
> Andrew Lentvorski schrieb:
> > Ruth Karl wrote:
> >> Thanks, but I looked at it several times (even before I ever wrote to
> >> this list) and still I can not see why when I start an input with
> >> with '<sx' the lexer should loose itself in a rule wanting '<script'
> >> as an input. (given the grammar I attached in my last posting).
> >> Any other suggestions?
> >
> > Looks like a bug to me:
> >
> > grammar jsp;
> >
> > JAVASCRIPT    :    '<script>' ( options {greedy=false;} : . )*
> > '</script>' {System.out.print("J");};
> > ANY    :    . {System.out.print("A");};
> >
> > jsp        :    (ANY | JAVASCRIPT)* EOF;
> >
> > with input:
> >
> > <script>foo</script>
> > <s>bar</s>
> >
> >
> > Produces a token stream of:
> > "<script>foo</script>", "a", "r", "<", "/", "s", ">"
> >
> > aka
> >
> > JAVASCRIPT, ANY, ANY, ANY, ANY, ANY, ANY
> >
> > Something vacuums up the "<s>b"
> >
> > The output is:
> > line 2:2 mismatched character '>' expecting 'c'
> > JAAAAAAAA
> >
> > You might want to file it and see what the response is.
> >
> > -a
> >
>


More information about the antlr-interest mailing list