[antlr-interest] lexer problem (BUG?)

Fri Jul 27 08:52:58 PDT 2007

Thomas Brandon schrieb:
> On 7/28/07, Ruth Karl <ruth.karl at gmx.de> wrote:
>   
>> Thanks, Thomas.
>> I did try your workaround (with predicate...), but antlr still chokes on
>> an input like <s>....
>> (same problem: it does not see the '<s' anymore and then falls over the
>> sudden closing tag...
>> I guess I will try to treat javascript in another way, then... (a
>> pity... ;-)
>>     
> Strange. I am able to correctly parse "<set><script>test</script><s>"
> after making the modifications I gave. Note the interpreter in
> ANTLRWorks doesn't execute actions or predicates so it won't work
> there.
>   

Yes, I kind of knew /guessed, that antlworks would maybe not be able to 
treat this right, but neither did the generated classes...
But as I am just discovering, there is a fundamental problem in the 
generation from antlworks now...even after completely removing any 
'script'- rule from the lexer grammar, i still get a lexer rule 
generated which tries toMatch("</script>")...
I guess my system needs some garbage collection there... ;-)

So maybe after that, your suggestion will work - I 'll let you know ;-)

Thanks,

Ruth
> The full grammar I used was:
> grammar JSP;
>
> options {
>                output=AST;
> backtrack=true;
>                memoize=true;
> }
>
> // Lexer rules
>
> TEXT            :
> ((~('<'|'>'|'%'|'/'|'"'|'\''|'('|')'|'['|']'|'{'|'}'|'\n'|'\t'|'\r'))
> | ESCQUOTE)+
>        ;
> WS      :       (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; }
>        ;
> JAVACOMMENT     :       '/*' ( options {greedy=false;} : . )* '*/'
> {$channel=HIDDEN;}
>        ;
> HTMLCOMMENT     :       '<!--' ( options {greedy=false;} : . )* '-->'
> {$channel=HIDDEN;}
>        ;
> SCRIPTCOMMENT   :       '<%--' ( options {greedy=false;} : . )* '--%>'
> {$channel=HIDDEN;}
>        ;
>
> DOCTYPE :       '<!DOCTYPE' ( options {greedy=false;} : . )* '>'
>        ;
> DIRECTIVE       :       '<%@' ( options {greedy=false;} : . )* '%>'
>        ;
> DECLARATION     :       '<%!' ( options {greedy=false;} : . )* '%>'
>        ;
>
> SCRIPTLETSTART  :       '<%'
>        ;
> SCRIPTLETEND    :       '%>'
>        ;
> EMPTYHTMLEND    :       '/>'
>        ;
> ESCQUOTE        :       '\\' (options {greedy=false;} : ('"' | '\''))
>        ;
>
> fragment
> JAVASCRIPT      :       '<script' ( options {greedy=false;} : . )* '</script>'
>        ;
> OPENTAG         :       ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
> 				|		'<'
>        ;
> CLOSETAG        :       '>'
>                ;
> SLASH           :       '/'
>        ;
> PERCENT :       '%'
>        ;
> LPAR    :       '('
>        ;
> RPAR    :       ')'
>        ;
> LCURL   :       '{'
>        ;
> RCURL   :       '}'
>        ;
> LBRA    :       '['
>        ;
> RBRA    :       ']'
>        ;
>
> // LEXER: imaginary tokens/nodes for AST
>
> SCRIPTLET       :
>        ;
> HTMLTAG :
>        ;
> QUOTED  :
>        ;
> BRACKETEX       :
>        ;
> JS      :
>        ;
>
>
>
> // Parser rules
>
> jsp     :       (content)* EOF
>                ;
> content         :       scriptlet
>        |       htmltag
>        |       quoted
>        |       text
>        |       PERCENT
>        |       bracketexpr
>        |       DOCTYPE
>        |       RPAR
>        |       RCURL
>        |       RBRA
>        |       slashComment
>        |       directive
>        |       declaration
>        |       javascript
>                ;
> scriptlet       :       SCRIPTLETSTART (content)*  SCRIPTLETEND
> ->^(SCRIPTLET content*)
>        ;
> htmltag :       OPENTAG (SLASH)? (htmltagcontent |slashComment)*
> (EMPTYHTMLEND |CLOSETAG) ->^(HTMLTAG htmltagcontent*)
>        ;
> htmltagcontent  :       TEXT (PERCENT | TEXT)*
>        |       bracketexpr
>        |       quoted
>        |       scriptlet
>        ;
> javascript      :       JAVASCRIPT ->^(JS JAVASCRIPT)
>        ;
> bracketexpr     :       LPAR expr* (RPAR)? ->^(BRACKETEX LPAR expr*)
>        |       LCURL expr* (RCURL)? ->^(BRACKETEX LCURL expr*)
>        |       LBRA expr* (RBRA)? ->^(BRACKETEX LBRA expr*)
>        ;
> expr    :       text
>        |       SLASH
>        |       OPENTAG
>        |       CLOSETAG
>        |       PERCENT
>        |       '\\'
>        |       bracketexpr
>        |       quoted
>        ;
> slashComment    :       SLASH SLASH (TEXT)
>        ;
> text    :       TEXT  -> TEXT
>        ;
> quoted  :       dquoted
>        |       squoted
>        ;
> dquoted :       '"' ( options {greedy=false;} : (dquotecontent) )* '"'
> ->^(QUOTED dquotecontent*)
>        ;
> dquotecontent   :       text
>        |       scriptlet
>        |       bracketexpr
>        |       SLASH
>        |       OPENTAG
>        |       CLOSETAG
>        |       PERCENT
>        |       RPAR
>        |       '\\'
>        |       squoted
>        ;
> squoted :       '\'' ( options {greedy=false;} : (squotecontent)  )*
> '\''  ->^(QUOTED squotecontent*)
>        ;
> squotecontent   :       text
>        |       scriptlet
>        |       bracketexpr
>        |       SLASH
>        |       OPENTAG
>        |       CLOSETAG
>        |       PERCENT
>        |       RPAR
>        |       '\\'
>        |       dquoted
>        ;
> directive       :       DIRECTIVE
>        ;
> declaration     :       DECLARATION
>        ;
>
> Tom.
>   
>> Ruth
>>
>> Thomas Brandon schrieb:
>>     
>>> On 7/27/07, Ruth Karl <ruth.karl at gmx.de> wrote:
>>>
>>>       
>>>> Hi Andrew,
>>>>
>>>> thanks a lot for finding a smaller example to illustrate the problem.
>>>> (Did you do it for java target or for c# - as I did?)
>>>>
>>>> Now: what can I do?
>>>> I could (...) try to find a workaround in my grammar, but if it IS a bug
>>>> - than a similar thing might happen in other cases as well....
>>>>
>>>>
>>>>         
>>> It's not a bug. Though it may be considered a limitation.
>>> The problem is that ANTLR's prediction algorithm doesn't look past
>>> token boundaries so it makes it's predictions based on only a single
>>> token. As the only possible single token matches for '<' followed by
>>> anything are JAVASCRIPT and OPENTAG (talking about your original
>>> grammar here, not the shorter sample) as soon as ANTLR see's '<s' it
>>> predicts that it must be JAVASCRIPT, then gives an error when that
>>> won't match. Looking at the mTokens method ANTLR generates may help
>>> you see what is going on The problem is discussed in
>>> http://www.antlr.org/pipermail/antlr-interest/2007-July/022349.html
>>> .
>>> Unfortunately as ANTLR doesn't consider there to be any ambiguity
>>> backtracking won't help and a predicate in OPENTAG won't be hoisted. A
>>> fix for your original grammar is to replace the previous rules with:
>>> fragment
>>> JAVASCRIPT      :       '<script' ( options {greedy=false;} : . )* '</script>'
>>>        ;
>>> OPENTAG         :       ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
>>>                               |               '<'
>>>        ;
>>>
>>> Ter said he'd investigate the possibility of enhancing the prediction
>>> algorithm to deal with such cases.
>>>
>>> Tom.
>>>
>>>       
>>>> Thanks for any further suggestions,
>>>>
>>>> Ruth
>>>>
>>>>
>>>> Andrew Lentvorski schrieb:
>>>>
>>>>         
>>>>> Ruth Karl wrote:
>>>>>
>>>>>           
>>>>>> Thanks, but I looked at it several times (even before I ever wrote to
>>>>>> this list) and still I can not see why when I start an input with
>>>>>> with '<sx' the lexer should loose itself in a rule wanting '<script'
>>>>>> as an input. (given the grammar I attached in my last posting).
>>>>>> Any other suggestions?
>>>>>>
>>>>>>             
>>>>> Looks like a bug to me:
>>>>>
>>>>> grammar jsp;
>>>>>
>>>>> JAVASCRIPT    :    '<script>' ( options {greedy=false;} : . )*
>>>>> '</script>' {System.out.print("J");};
>>>>> ANY    :    . {System.out.print("A");};
>>>>>
>>>>> jsp        :    (ANY | JAVASCRIPT)* EOF;
>>>>>
>>>>> with input:
>>>>>
>>>>> <script>foo</script>
>>>>> <s>bar</s>
>>>>>
>>>>>
>>>>> Produces a token stream of:
>>>>> "<script>foo</script>", "a", "r", "<", "/", "s", ">"
>>>>>
>>>>> aka
>>>>>
>>>>> JAVASCRIPT, ANY, ANY, ANY, ANY, ANY, ANY
>>>>>
>>>>> Something vacuums up the "<s>b"
>>>>>
>>>>> The output is:
>>>>> line 2:2 mismatched character '>' expecting 'c'
>>>>> JAAAAAAAA
>>>>>
>>>>> You might want to file it and see what the response is.
>>>>>
>>>>> -a
>>>>>
>>>>>
>>>>>           
>>>       
>
>