[antlr-interest] lexer problem (BUG?)
Ruth Karl
ruth.karl at gmx.de
Fri Jul 27 09:35:24 PDT 2007
Ruth Karl schrieb:
> Thomas Brandon schrieb:
>> On 7/28/07, Ruth Karl <ruth.karl at gmx.de> wrote:
>>
>>> Thanks, Thomas.
>>> I did try your workaround (with predicate...), but antlr still
>>> chokes on
>>> an input like <s>....
>>> (same problem: it does not see the '<s' anymore and then falls over the
>>> sudden closing tag...
>>> I guess I will try to treat javascript in another way, then... (a
>>> pity... ;-)
>>>
>> Strange. I am able to correctly parse "<set><script>test</script><s>"
>> after making the modifications I gave. Note the interpreter in
>> ANTLRWorks doesn't execute actions or predicates so it won't work
>> there.
>>
>
> Yes, I kind of knew /guessed, that antlworks would maybe not be able
> to treat this right, but neither did the generated classes...
> But as I am just discovering, there is a fundamental problem in the
> generation from antlworks now...even after completely removing any
> 'script'- rule from the lexer grammar, i still get a lexer rule
> generated which tries toMatch("</script>")...
> I guess my system needs some garbage collection there... ;-)
>
> So maybe after that, your suggestion will work - I 'll let you know ;-)
YES: it does! thanks again!
Ruth
>
> Thanks,
>
> Ruth
>> The full grammar I used was:
>> grammar JSP;
>>
>> options {
>> output=AST;
>> backtrack=true;
>> memoize=true;
>> }
>>
>> // Lexer rules
>>
>> TEXT :
>> ((~('<'|'>'|'%'|'/'|'"'|'\''|'('|')'|'['|']'|'{'|'}'|'\n'|'\t'|'\r'))
>> | ESCQUOTE)+
>> ;
>> WS : (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; }
>> ;
>> JAVACOMMENT : '/*' ( options {greedy=false;} : . )* '*/'
>> {$channel=HIDDEN;}
>> ;
>> HTMLCOMMENT : '<!--' ( options {greedy=false;} : . )* '-->'
>> {$channel=HIDDEN;}
>> ;
>> SCRIPTCOMMENT : '<%--' ( options {greedy=false;} : . )* '--%>'
>> {$channel=HIDDEN;}
>> ;
>>
>> DOCTYPE : '<!DOCTYPE' ( options {greedy=false;} : . )* '>'
>> ;
>> DIRECTIVE : '<%@' ( options {greedy=false;} : . )* '%>'
>> ;
>> DECLARATION : '<%!' ( options {greedy=false;} : . )* '%>'
>> ;
>>
>> SCRIPTLETSTART : '<%'
>> ;
>> SCRIPTLETEND : '%>'
>> ;
>> EMPTYHTMLEND : '/>'
>> ;
>> ESCQUOTE : '\\' (options {greedy=false;} : ('"' | '\''))
>> ;
>>
>> fragment
>> JAVASCRIPT : '<script' ( options {greedy=false;} : . )*
>> '</script>'
>> ;
>> OPENTAG : ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
>> | '<'
>> ;
>> CLOSETAG : '>'
>> ;
>> SLASH : '/'
>> ;
>> PERCENT : '%'
>> ;
>> LPAR : '('
>> ;
>> RPAR : ')'
>> ;
>> LCURL : '{'
>> ;
>> RCURL : '}'
>> ;
>> LBRA : '['
>> ;
>> RBRA : ']'
>> ;
>>
>> // LEXER: imaginary tokens/nodes for AST
>>
>> SCRIPTLET :
>> ;
>> HTMLTAG :
>> ;
>> QUOTED :
>> ;
>> BRACKETEX :
>> ;
>> JS :
>> ;
>>
>>
>>
>> // Parser rules
>>
>> jsp : (content)* EOF
>> ;
>> content : scriptlet
>> | htmltag
>> | quoted
>> | text
>> | PERCENT
>> | bracketexpr
>> | DOCTYPE
>> | RPAR
>> | RCURL
>> | RBRA
>> | slashComment
>> | directive
>> | declaration
>> | javascript
>> ;
>> scriptlet : SCRIPTLETSTART (content)* SCRIPTLETEND
>> ->^(SCRIPTLET content*)
>> ;
>> htmltag : OPENTAG (SLASH)? (htmltagcontent |slashComment)*
>> (EMPTYHTMLEND |CLOSETAG) ->^(HTMLTAG htmltagcontent*)
>> ;
>> htmltagcontent : TEXT (PERCENT | TEXT)*
>> | bracketexpr
>> | quoted
>> | scriptlet
>> ;
>> javascript : JAVASCRIPT ->^(JS JAVASCRIPT)
>> ;
>> bracketexpr : LPAR expr* (RPAR)? ->^(BRACKETEX LPAR expr*)
>> | LCURL expr* (RCURL)? ->^(BRACKETEX LCURL expr*)
>> | LBRA expr* (RBRA)? ->^(BRACKETEX LBRA expr*)
>> ;
>> expr : text
>> | SLASH
>> | OPENTAG
>> | CLOSETAG
>> | PERCENT
>> | '\\'
>> | bracketexpr
>> | quoted
>> ;
>> slashComment : SLASH SLASH (TEXT)
>> ;
>> text : TEXT -> TEXT
>> ;
>> quoted : dquoted
>> | squoted
>> ;
>> dquoted : '"' ( options {greedy=false;} : (dquotecontent) )* '"'
>> ->^(QUOTED dquotecontent*)
>> ;
>> dquotecontent : text
>> | scriptlet
>> | bracketexpr
>> | SLASH
>> | OPENTAG
>> | CLOSETAG
>> | PERCENT
>> | RPAR
>> | '\\'
>> | squoted
>> ;
>> squoted : '\'' ( options {greedy=false;} : (squotecontent) )*
>> '\'' ->^(QUOTED squotecontent*)
>> ;
>> squotecontent : text
>> | scriptlet
>> | bracketexpr
>> | SLASH
>> | OPENTAG
>> | CLOSETAG
>> | PERCENT
>> | RPAR
>> | '\\'
>> | dquoted
>> ;
>> directive : DIRECTIVE
>> ;
>> declaration : DECLARATION
>> ;
>>
>> Tom.
>>
>>> Ruth
>>>
>>> Thomas Brandon schrieb:
>>>
>>>> On 7/27/07, Ruth Karl <ruth.karl at gmx.de> wrote:
>>>>
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> thanks a lot for finding a smaller example to illustrate the problem.
>>>>> (Did you do it for java target or for c# - as I did?)
>>>>>
>>>>> Now: what can I do?
>>>>> I could (...) try to find a workaround in my grammar, but if it IS
>>>>> a bug
>>>>> - than a similar thing might happen in other cases as well....
>>>>>
>>>>>
>>>>>
>>>> It's not a bug. Though it may be considered a limitation.
>>>> The problem is that ANTLR's prediction algorithm doesn't look past
>>>> token boundaries so it makes it's predictions based on only a single
>>>> token. As the only possible single token matches for '<' followed by
>>>> anything are JAVASCRIPT and OPENTAG (talking about your original
>>>> grammar here, not the shorter sample) as soon as ANTLR see's '<s' it
>>>> predicts that it must be JAVASCRIPT, then gives an error when that
>>>> won't match. Looking at the mTokens method ANTLR generates may help
>>>> you see what is going on The problem is discussed in
>>>> http://www.antlr.org/pipermail/antlr-interest/2007-July/022349.html
>>>>
>>>> .
>>>> Unfortunately as ANTLR doesn't consider there to be any ambiguity
>>>> backtracking won't help and a predicate in OPENTAG won't be hoisted. A
>>>> fix for your original grammar is to replace the previous rules with:
>>>> fragment
>>>> JAVASCRIPT : '<script' ( options {greedy=false;} : . )*
>>>> '</script>'
>>>> ;
>>>> OPENTAG : ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
>>>> | '<'
>>>> ;
>>>>
>>>> Ter said he'd investigate the possibility of enhancing the prediction
>>>> algorithm to deal with such cases.
>>>>
>>>> Tom.
>>>>
>>>>
>>>>> Thanks for any further suggestions,
>>>>>
>>>>> Ruth
>>>>>
>>>>>
>>>>> Andrew Lentvorski schrieb:
>>>>>
>>>>>
>>>>>> Ruth Karl wrote:
>>>>>>
>>>>>>
>>>>>>> Thanks, but I looked at it several times (even before I ever
>>>>>>> wrote to
>>>>>>> this list) and still I can not see why when I start an input with
>>>>>>> with '<sx' the lexer should loose itself in a rule wanting
>>>>>>> '<script'
>>>>>>> as an input. (given the grammar I attached in my last posting).
>>>>>>> Any other suggestions?
>>>>>>>
>>>>>>>
>>>>>> Looks like a bug to me:
>>>>>>
>>>>>> grammar jsp;
>>>>>>
>>>>>> JAVASCRIPT : '<script>' ( options {greedy=false;} : . )*
>>>>>> '</script>' {System.out.print("J");};
>>>>>> ANY : . {System.out.print("A");};
>>>>>>
>>>>>> jsp : (ANY | JAVASCRIPT)* EOF;
>>>>>>
>>>>>> with input:
>>>>>>
>>>>>> <script>foo</script>
>>>>>> <s>bar</s>
>>>>>>
>>>>>>
>>>>>> Produces a token stream of:
>>>>>> "<script>foo</script>", "a", "r", "<", "/", "s", ">"
>>>>>>
>>>>>> aka
>>>>>>
>>>>>> JAVASCRIPT, ANY, ANY, ANY, ANY, ANY, ANY
>>>>>>
>>>>>> Something vacuums up the "<s>b"
>>>>>>
>>>>>> The output is:
>>>>>> line 2:2 mismatched character '>' expecting 'c'
>>>>>> JAAAAAAAA
>>>>>>
>>>>>> You might want to file it and see what the response is.
>>>>>>
>>>>>> -a
>>>>>>
>>>>>>
>>>>>>
>>>>
>>
>>
>
More information about the antlr-interest
mailing list