[antlr-interest] lexer problem (BUG?)

Fri Jul 27 09:35:24 PDT 2007

Ruth Karl schrieb:
> Thomas Brandon schrieb:
>> On 7/28/07, Ruth Karl <ruth.karl at gmx.de> wrote:
>>  
>>> Thanks, Thomas.
>>> I did try your workaround (with predicate...), but antlr still 
>>> chokes on
>>> an input like <s>....
>>> (same problem: it does not see the '<s' anymore and then falls over the
>>> sudden closing tag...
>>> I guess I will try to treat javascript in another way, then... (a
>>> pity... ;-)
>>>     
>> Strange. I am able to correctly parse "<set><script>test</script><s>"
>> after making the modifications I gave. Note the interpreter in
>> ANTLRWorks doesn't execute actions or predicates so it won't work
>> there.
>>   
>
> Yes, I kind of knew /guessed, that antlworks would maybe not be able 
> to treat this right, but neither did the generated classes...
> But as I am just discovering, there is a fundamental problem in the 
> generation from antlworks now...even after completely removing any 
> 'script'- rule from the lexer grammar, i still get a lexer rule 
> generated which tries toMatch("</script>")...
> I guess my system needs some garbage collection there... ;-)
>
> So maybe after that, your suggestion will work - I 'll let you know ;-)

YES: it does! thanks again!
Ruth

>
> Thanks,
>
> Ruth
>> The full grammar I used was:
>> grammar JSP;
>>
>> options {
>>                output=AST;
>> backtrack=true;
>>                memoize=true;
>> }
>>
>> // Lexer rules
>>
>> TEXT            :
>> ((~('<'|'>'|'%'|'/'|'"'|'\''|'('|')'|'['|']'|'{'|'}'|'\n'|'\t'|'\r'))
>> | ESCQUOTE)+
>>        ;
>> WS      :       (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; }
>>        ;
>> JAVACOMMENT     :       '/*' ( options {greedy=false;} : . )* '*/'
>> {$channel=HIDDEN;}
>>        ;
>> HTMLCOMMENT     :       '<!--' ( options {greedy=false;} : . )* '-->'
>> {$channel=HIDDEN;}
>>        ;
>> SCRIPTCOMMENT   :       '<%--' ( options {greedy=false;} : . )* '--%>'
>> {$channel=HIDDEN;}
>>        ;
>>
>> DOCTYPE :       '<!DOCTYPE' ( options {greedy=false;} : . )* '>'
>>        ;
>> DIRECTIVE       :       '<%@' ( options {greedy=false;} : . )* '%>'
>>        ;
>> DECLARATION     :       '<%!' ( options {greedy=false;} : . )* '%>'
>>        ;
>>
>> SCRIPTLETSTART  :       '<%'
>>        ;
>> SCRIPTLETEND    :       '%>'
>>        ;
>> EMPTYHTMLEND    :       '/>'
>>        ;
>> ESCQUOTE        :       '\\' (options {greedy=false;} : ('"' | '\''))
>>        ;
>>
>> fragment
>> JAVASCRIPT      :       '<script' ( options {greedy=false;} : . )* 
>> '</script>'
>>        ;
>> OPENTAG         :       ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
>>                 |        '<'
>>        ;
>> CLOSETAG        :       '>'
>>                ;
>> SLASH           :       '/'
>>        ;
>> PERCENT :       '%'
>>        ;
>> LPAR    :       '('
>>        ;
>> RPAR    :       ')'
>>        ;
>> LCURL   :       '{'
>>        ;
>> RCURL   :       '}'
>>        ;
>> LBRA    :       '['
>>        ;
>> RBRA    :       ']'
>>        ;
>>
>> // LEXER: imaginary tokens/nodes for AST
>>
>> SCRIPTLET       :
>>        ;
>> HTMLTAG :
>>        ;
>> QUOTED  :
>>        ;
>> BRACKETEX       :
>>        ;
>> JS      :
>>        ;
>>
>>
>>
>> // Parser rules
>>
>> jsp     :       (content)* EOF
>>                ;
>> content         :       scriptlet
>>        |       htmltag
>>        |       quoted
>>        |       text
>>        |       PERCENT
>>        |       bracketexpr
>>        |       DOCTYPE
>>        |       RPAR
>>        |       RCURL
>>        |       RBRA
>>        |       slashComment
>>        |       directive
>>        |       declaration
>>        |       javascript
>>                ;
>> scriptlet       :       SCRIPTLETSTART (content)*  SCRIPTLETEND
>> ->^(SCRIPTLET content*)
>>        ;
>> htmltag :       OPENTAG (SLASH)? (htmltagcontent |slashComment)*
>> (EMPTYHTMLEND |CLOSETAG) ->^(HTMLTAG htmltagcontent*)
>>        ;
>> htmltagcontent  :       TEXT (PERCENT | TEXT)*
>>        |       bracketexpr
>>        |       quoted
>>        |       scriptlet
>>        ;
>> javascript      :       JAVASCRIPT ->^(JS JAVASCRIPT)
>>        ;
>> bracketexpr     :       LPAR expr* (RPAR)? ->^(BRACKETEX LPAR expr*)
>>        |       LCURL expr* (RCURL)? ->^(BRACKETEX LCURL expr*)
>>        |       LBRA expr* (RBRA)? ->^(BRACKETEX LBRA expr*)
>>        ;
>> expr    :       text
>>        |       SLASH
>>        |       OPENTAG
>>        |       CLOSETAG
>>        |       PERCENT
>>        |       '\\'
>>        |       bracketexpr
>>        |       quoted
>>        ;
>> slashComment    :       SLASH SLASH (TEXT)
>>        ;
>> text    :       TEXT  -> TEXT
>>        ;
>> quoted  :       dquoted
>>        |       squoted
>>        ;
>> dquoted :       '"' ( options {greedy=false;} : (dquotecontent) )* '"'
>> ->^(QUOTED dquotecontent*)
>>        ;
>> dquotecontent   :       text
>>        |       scriptlet
>>        |       bracketexpr
>>        |       SLASH
>>        |       OPENTAG
>>        |       CLOSETAG
>>        |       PERCENT
>>        |       RPAR
>>        |       '\\'
>>        |       squoted
>>        ;
>> squoted :       '\'' ( options {greedy=false;} : (squotecontent)  )*
>> '\''  ->^(QUOTED squotecontent*)
>>        ;
>> squotecontent   :       text
>>        |       scriptlet
>>        |       bracketexpr
>>        |       SLASH
>>        |       OPENTAG
>>        |       CLOSETAG
>>        |       PERCENT
>>        |       RPAR
>>        |       '\\'
>>        |       dquoted
>>        ;
>> directive       :       DIRECTIVE
>>        ;
>> declaration     :       DECLARATION
>>        ;
>>
>> Tom.
>>  
>>> Ruth
>>>
>>> Thomas Brandon schrieb:
>>>    
>>>> On 7/27/07, Ruth Karl <ruth.karl at gmx.de> wrote:
>>>>
>>>>      
>>>>> Hi Andrew,
>>>>>
>>>>> thanks a lot for finding a smaller example to illustrate the problem.
>>>>> (Did you do it for java target or for c# - as I did?)
>>>>>
>>>>> Now: what can I do?
>>>>> I could (...) try to find a workaround in my grammar, but if it IS 
>>>>> a bug
>>>>> - than a similar thing might happen in other cases as well....
>>>>>
>>>>>
>>>>>         
>>>> It's not a bug. Though it may be considered a limitation.
>>>> The problem is that ANTLR's prediction algorithm doesn't look past
>>>> token boundaries so it makes it's predictions based on only a single
>>>> token. As the only possible single token matches for '<' followed by
>>>> anything are JAVASCRIPT and OPENTAG (talking about your original
>>>> grammar here, not the shorter sample) as soon as ANTLR see's '<s' it
>>>> predicts that it must be JAVASCRIPT, then gives an error when that
>>>> won't match. Looking at the mTokens method ANTLR generates may help
>>>> you see what is going on The problem is discussed in
>>>> http://www.antlr.org/pipermail/antlr-interest/2007-July/022349.html 
>>>>
>>>> .
>>>> Unfortunately as ANTLR doesn't consider there to be any ambiguity
>>>> backtracking won't help and a predicate in OPENTAG won't be hoisted. A
>>>> fix for your original grammar is to replace the previous rules with:
>>>> fragment
>>>> JAVASCRIPT      :       '<script' ( options {greedy=false;} : . )* 
>>>> '</script>'
>>>>        ;
>>>> OPENTAG         :       ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
>>>>                               |               '<'
>>>>        ;
>>>>
>>>> Ter said he'd investigate the possibility of enhancing the prediction
>>>> algorithm to deal with such cases.
>>>>
>>>> Tom.
>>>>
>>>>      
>>>>> Thanks for any further suggestions,
>>>>>
>>>>> Ruth
>>>>>
>>>>>
>>>>> Andrew Lentvorski schrieb:
>>>>>
>>>>>        
>>>>>> Ruth Karl wrote:
>>>>>>
>>>>>>          
>>>>>>> Thanks, but I looked at it several times (even before I ever 
>>>>>>> wrote to
>>>>>>> this list) and still I can not see why when I start an input with
>>>>>>> with '<sx' the lexer should loose itself in a rule wanting 
>>>>>>> '<script'
>>>>>>> as an input. (given the grammar I attached in my last posting).
>>>>>>> Any other suggestions?
>>>>>>>
>>>>>>>             
>>>>>> Looks like a bug to me:
>>>>>>
>>>>>> grammar jsp;
>>>>>>
>>>>>> JAVASCRIPT    :    '<script>' ( options {greedy=false;} : . )*
>>>>>> '</script>' {System.out.print("J");};
>>>>>> ANY    :    . {System.out.print("A");};
>>>>>>
>>>>>> jsp        :    (ANY | JAVASCRIPT)* EOF;
>>>>>>
>>>>>> with input:
>>>>>>
>>>>>> <script>foo</script>
>>>>>> <s>bar</s>
>>>>>>
>>>>>>
>>>>>> Produces a token stream of:
>>>>>> "<script>foo</script>", "a", "r", "<", "/", "s", ">"
>>>>>>
>>>>>> aka
>>>>>>
>>>>>> JAVASCRIPT, ANY, ANY, ANY, ANY, ANY, ANY
>>>>>>
>>>>>> Something vacuums up the "<s>b"
>>>>>>
>>>>>> The output is:
>>>>>> line 2:2 mismatched character '>' expecting 'c'
>>>>>> JAAAAAAAA
>>>>>>
>>>>>> You might want to file it and see what the response is.
>>>>>>
>>>>>> -a
>>>>>>
>>>>>>
>>>>>>           
>>>>       
>>
>>   
>