[antlr-interest] lexer problem (BUG?)

Daniel Brosseau daniel at lba.ca
Fri Jul 27 12:27:18 PDT 2007


----- Original Message ----- 
From: "Thomas Brandon" <tbrandonau at gmail.com>
To: "ANTLR-Interest" <antlr-interest at antlr.org>
Sent: Friday, July 27, 2007 12:52 PM
Subject: Re: [antlr-interest] lexer problem (BUG?)


> On 7/28/07, Daniel Brosseau <daniel at lba.ca> wrote:
>> How about this very simple example I tried in ANTLRWorks 1.1:
>>
>> grammar lex;
>> fragment KEYWORD  :  'a' 'b' 'c';
>> fragment OTHER : 'a'|'b'|'c'|'d';
>> TOKEN : (KEYWORD)=> KEYWORD { $type = KEYWORD; }
>>                | OTHER { $type = OTHER; };
>> token : TOKEN;
>> program : token*;
>>
> Either remove the type setting or try something like:
> grammar lex;
>
> fragment KEYWORD  :  'a' 'b' 'c';
> fragment OTHER : 'a'|'b'|'c'|'d';
> TOKEN : (KEYWORD)=> KEYWORD { $type = KEYWORD; }
>              | OTHER { $type = OTHER; };
> token : OTHER;
> keyword: KEYWORD;
> program : (token|keyword)+ EOF;
>
> Which will correctly handle everything in the debugger and give a
> MismatchedTokenException in the interpreter as without the actions
> it's returning only TOKEN.
>

I tried:

grammar lex;
fragment KEYWORD  :  'a' 'b' 'c';
fragment OTHER : 'a'|'b'|'c'|'d';
TOKEN : (KEYWORD)=> KEYWORD
              | OTHER
token : TOKEN;
program : token* + EOF;

With no actions in the LEXER and an added EOF in the PARSER, the debugger 
broke up each of 'abc', 'abcd' and 'abd' properly :-)

>> With input: "abd", the interpreter breaks up the input into 'ab' and 'd'.
>>
>> Now if the interpreter does not execute predicates then I can see that it
>> would not have seen the (KEYWORD) predicate and would have choked after 
>> 'ab'
>>
>> With input "abc", the interpreter breaks up the input into 'ab' and 'c'.
>>
>> But here it should have eaten up 'abc' regardless... even with input 
>> 'abcd'
>> it breaks it up into 'ab' 'c' and 'd' and not 'abc' and 'd'.
> Not sure what the interpreter's doing here. Might be an interpreter
> bug. Looks like nothing should match 'ab'. Given no actions I thought
> the interpreter should be running:
> grammar lex;
>

As you indicated, removing the predicate (KEYWORD) gets the intrepreter to 
work OK :-) but it will evidently still choke on 'abd'.

Thanks, this clears up several issues for me. Wonderful program.

Daniel

> fragment KEYWORD  :  'a' 'b' 'c';
> fragment OTHER : 'a'|'b'|'c'|'d';
> TOKEN : KEYWORD
>      | OTHER;
> token : TOKEN;
> program : token*;
>
> Using this the interpreter output does line up with the debugger.
>
> Tom.
>>
>> With input "abcd", in the debugger I get
>> root
>> program
>> org.antlr.runtime.EarlyExitException
>>
>> and the input "abcd" is in a red box in the Input window. The Output 
>> window
>> had:
>> line 1:0 required (...)+ loop did not match anything at input 'abc'
>>
>> None of this seems right. What am I missing?
>>
>> Daniel
>>
>> > On 7/28/07, Ruth Karl <ruth.karl at gmx.de> wrote:
>> > Strange. I am able to correctly parse "<set><script>test</script><s>"
>> > after making the modifications I gave. Note the interpreter in
>> > ANTLRWorks doesn't execute actions or predicates so it won't work
>> > there.
>> > The full grammar I used was:
>> ...
>> >> > On 7/27/07, Ruth Karl <ruth.karl at gmx.de> wrote:
>> >> >
>> >> > It's not a bug. Though it may be considered a limitation.
>> >> > The problem is that ANTLR's prediction algorithm doesn't look past
>> >> > token boundaries so it makes it's predictions based on only a single
>> >> > token. As the only possible single token matches for '<' followed by
>> >> > anything are JAVASCRIPT and OPENTAG (talking about your original
>> >> > grammar here, not the shorter sample) as soon as ANTLR see's '<s' it
>> >> > predicts that it must be JAVASCRIPT, then gives an error when that
>> >> > won't match. Looking at the mTokens method ANTLR generates may help
>> >> > you see what is going on The problem is discussed in
>> >> > http://www.antlr.org/pipermail/antlr-interest/2007-July/022349.html
>> >> > .
>> >> > Unfortunately as ANTLR doesn't consider there to be any ambiguity
>> >> > backtracking won't help and a predicate in OPENTAG won't be hoisted. 
>> >> > A
>> >> > fix for your original grammar is to replace the previous rules with:
>> >> > fragment
>> >> > JAVASCRIPT      :       '<script' ( options {greedy=false;} : . )*
>> >> > '</script>'
>> >> >        ;
>> >> > OPENTAG         :       ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
>> >> >                               |               '<'
>> >> >        ;
>> >> >
>> >> > Ter said he'd investigate the possibility of enhancing the 
>> >> > prediction
>> >> > algorithm to deal with such cases.
>> >> >
>> >> > Tom.
>> >> >
>> >> >> Thanks for any further suggestions,
>> >> >>
>> >> >> Ruth
>>
>> 



More information about the antlr-interest mailing list