[antlr-interest] lexer problem (BUG?)

Ruth Karl ruth.karl at gmx.de
Fri Jul 27 08:18:05 PDT 2007


Thanks, Thomas.
I did try your workaround (with predicate...), but antlr still chokes on 
an input like <s>....
(same problem: it does not see the '<s' anymore and then falls over the 
sudden closing tag...
I guess I will try to treat javascript in another way, then... (a 
pity... ;-)
Ruth

Thomas Brandon schrieb:
> On 7/27/07, Ruth Karl <ruth.karl at gmx.de> wrote:
>   
>> Hi Andrew,
>>
>> thanks a lot for finding a smaller example to illustrate the problem.
>> (Did you do it for java target or for c# - as I did?)
>>
>> Now: what can I do?
>> I could (...) try to find a workaround in my grammar, but if it IS a bug
>> - than a similar thing might happen in other cases as well....
>>
>>     
> It's not a bug. Though it may be considered a limitation.
> The problem is that ANTLR's prediction algorithm doesn't look past
> token boundaries so it makes it's predictions based on only a single
> token. As the only possible single token matches for '<' followed by
> anything are JAVASCRIPT and OPENTAG (talking about your original
> grammar here, not the shorter sample) as soon as ANTLR see's '<s' it
> predicts that it must be JAVASCRIPT, then gives an error when that
> won't match. Looking at the mTokens method ANTLR generates may help
> you see what is going on The problem is discussed in
> http://www.antlr.org/pipermail/antlr-interest/2007-July/022349.html
> .
> Unfortunately as ANTLR doesn't consider there to be any ambiguity
> backtracking won't help and a predicate in OPENTAG won't be hoisted. A
> fix for your original grammar is to replace the previous rules with:
> fragment
> JAVASCRIPT      :       '<script' ( options {greedy=false;} : . )* '</script>'
>        ;
> OPENTAG         :       ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
> 				|		'<'
>        ;
>
> Ter said he'd investigate the possibility of enhancing the prediction
> algorithm to deal with such cases.
>
> Tom.
>   
>> Thanks for any further suggestions,
>>
>> Ruth
>>
>>
>> Andrew Lentvorski schrieb:
>>     
>>> Ruth Karl wrote:
>>>       
>>>> Thanks, but I looked at it several times (even before I ever wrote to
>>>> this list) and still I can not see why when I start an input with
>>>> with '<sx' the lexer should loose itself in a rule wanting '<script'
>>>> as an input. (given the grammar I attached in my last posting).
>>>> Any other suggestions?
>>>>         
>>> Looks like a bug to me:
>>>
>>> grammar jsp;
>>>
>>> JAVASCRIPT    :    '<script>' ( options {greedy=false;} : . )*
>>> '</script>' {System.out.print("J");};
>>> ANY    :    . {System.out.print("A");};
>>>
>>> jsp        :    (ANY | JAVASCRIPT)* EOF;
>>>
>>> with input:
>>>
>>> <script>foo</script>
>>> <s>bar</s>
>>>
>>>
>>> Produces a token stream of:
>>> "<script>foo</script>", "a", "r", "<", "/", "s", ">"
>>>
>>> aka
>>>
>>> JAVASCRIPT, ANY, ANY, ANY, ANY, ANY, ANY
>>>
>>> Something vacuums up the "<s>b"
>>>
>>> The output is:
>>> line 2:2 mismatched character '>' expecting 'c'
>>> JAAAAAAAA
>>>
>>> You might want to file it and see what the response is.
>>>
>>> -a
>>>
>>>       
>
>   


More information about the antlr-interest mailing list