[antlr-interest] lexer problem (BUG?)
Ruth Karl
ruth.karl at gmx.de
Fri Jul 27 08:52:58 PDT 2007
Thomas Brandon schrieb:
> On 7/28/07, Ruth Karl <ruth.karl at gmx.de> wrote:
>
>> Thanks, Thomas.
>> I did try your workaround (with predicate...), but antlr still chokes on
>> an input like <s>....
>> (same problem: it does not see the '<s' anymore and then falls over the
>> sudden closing tag...
>> I guess I will try to treat javascript in another way, then... (a
>> pity... ;-)
>>
> Strange. I am able to correctly parse "<set><script>test</script><s>"
> after making the modifications I gave. Note the interpreter in
> ANTLRWorks doesn't execute actions or predicates so it won't work
> there.
>
Yes, I kind of knew /guessed, that antlworks would maybe not be able to
treat this right, but neither did the generated classes...
But as I am just discovering, there is a fundamental problem in the
generation from antlworks now...even after completely removing any
'script'- rule from the lexer grammar, i still get a lexer rule
generated which tries toMatch("</script>")...
I guess my system needs some garbage collection there... ;-)
So maybe after that, your suggestion will work - I 'll let you know ;-)
Thanks,
Ruth
> The full grammar I used was:
> grammar JSP;
>
> options {
> output=AST;
> backtrack=true;
> memoize=true;
> }
>
> // Lexer rules
>
> TEXT :
> ((~('<'|'>'|'%'|'/'|'"'|'\''|'('|')'|'['|']'|'{'|'}'|'\n'|'\t'|'\r'))
> | ESCQUOTE)+
> ;
> WS : (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; }
> ;
> JAVACOMMENT : '/*' ( options {greedy=false;} : . )* '*/'
> {$channel=HIDDEN;}
> ;
> HTMLCOMMENT : '<!--' ( options {greedy=false;} : . )* '-->'
> {$channel=HIDDEN;}
> ;
> SCRIPTCOMMENT : '<%--' ( options {greedy=false;} : . )* '--%>'
> {$channel=HIDDEN;}
> ;
>
> DOCTYPE : '<!DOCTYPE' ( options {greedy=false;} : . )* '>'
> ;
> DIRECTIVE : '<%@' ( options {greedy=false;} : . )* '%>'
> ;
> DECLARATION : '<%!' ( options {greedy=false;} : . )* '%>'
> ;
>
> SCRIPTLETSTART : '<%'
> ;
> SCRIPTLETEND : '%>'
> ;
> EMPTYHTMLEND : '/>'
> ;
> ESCQUOTE : '\\' (options {greedy=false;} : ('"' | '\''))
> ;
>
> fragment
> JAVASCRIPT : '<script' ( options {greedy=false;} : . )* '</script>'
> ;
> OPENTAG : ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
> | '<'
> ;
> CLOSETAG : '>'
> ;
> SLASH : '/'
> ;
> PERCENT : '%'
> ;
> LPAR : '('
> ;
> RPAR : ')'
> ;
> LCURL : '{'
> ;
> RCURL : '}'
> ;
> LBRA : '['
> ;
> RBRA : ']'
> ;
>
> // LEXER: imaginary tokens/nodes for AST
>
> SCRIPTLET :
> ;
> HTMLTAG :
> ;
> QUOTED :
> ;
> BRACKETEX :
> ;
> JS :
> ;
>
>
>
> // Parser rules
>
> jsp : (content)* EOF
> ;
> content : scriptlet
> | htmltag
> | quoted
> | text
> | PERCENT
> | bracketexpr
> | DOCTYPE
> | RPAR
> | RCURL
> | RBRA
> | slashComment
> | directive
> | declaration
> | javascript
> ;
> scriptlet : SCRIPTLETSTART (content)* SCRIPTLETEND
> ->^(SCRIPTLET content*)
> ;
> htmltag : OPENTAG (SLASH)? (htmltagcontent |slashComment)*
> (EMPTYHTMLEND |CLOSETAG) ->^(HTMLTAG htmltagcontent*)
> ;
> htmltagcontent : TEXT (PERCENT | TEXT)*
> | bracketexpr
> | quoted
> | scriptlet
> ;
> javascript : JAVASCRIPT ->^(JS JAVASCRIPT)
> ;
> bracketexpr : LPAR expr* (RPAR)? ->^(BRACKETEX LPAR expr*)
> | LCURL expr* (RCURL)? ->^(BRACKETEX LCURL expr*)
> | LBRA expr* (RBRA)? ->^(BRACKETEX LBRA expr*)
> ;
> expr : text
> | SLASH
> | OPENTAG
> | CLOSETAG
> | PERCENT
> | '\\'
> | bracketexpr
> | quoted
> ;
> slashComment : SLASH SLASH (TEXT)
> ;
> text : TEXT -> TEXT
> ;
> quoted : dquoted
> | squoted
> ;
> dquoted : '"' ( options {greedy=false;} : (dquotecontent) )* '"'
> ->^(QUOTED dquotecontent*)
> ;
> dquotecontent : text
> | scriptlet
> | bracketexpr
> | SLASH
> | OPENTAG
> | CLOSETAG
> | PERCENT
> | RPAR
> | '\\'
> | squoted
> ;
> squoted : '\'' ( options {greedy=false;} : (squotecontent) )*
> '\'' ->^(QUOTED squotecontent*)
> ;
> squotecontent : text
> | scriptlet
> | bracketexpr
> | SLASH
> | OPENTAG
> | CLOSETAG
> | PERCENT
> | RPAR
> | '\\'
> | dquoted
> ;
> directive : DIRECTIVE
> ;
> declaration : DECLARATION
> ;
>
> Tom.
>
>> Ruth
>>
>> Thomas Brandon schrieb:
>>
>>> On 7/27/07, Ruth Karl <ruth.karl at gmx.de> wrote:
>>>
>>>
>>>> Hi Andrew,
>>>>
>>>> thanks a lot for finding a smaller example to illustrate the problem.
>>>> (Did you do it for java target or for c# - as I did?)
>>>>
>>>> Now: what can I do?
>>>> I could (...) try to find a workaround in my grammar, but if it IS a bug
>>>> - than a similar thing might happen in other cases as well....
>>>>
>>>>
>>>>
>>> It's not a bug. Though it may be considered a limitation.
>>> The problem is that ANTLR's prediction algorithm doesn't look past
>>> token boundaries so it makes it's predictions based on only a single
>>> token. As the only possible single token matches for '<' followed by
>>> anything are JAVASCRIPT and OPENTAG (talking about your original
>>> grammar here, not the shorter sample) as soon as ANTLR see's '<s' it
>>> predicts that it must be JAVASCRIPT, then gives an error when that
>>> won't match. Looking at the mTokens method ANTLR generates may help
>>> you see what is going on The problem is discussed in
>>> http://www.antlr.org/pipermail/antlr-interest/2007-July/022349.html
>>> .
>>> Unfortunately as ANTLR doesn't consider there to be any ambiguity
>>> backtracking won't help and a predicate in OPENTAG won't be hoisted. A
>>> fix for your original grammar is to replace the previous rules with:
>>> fragment
>>> JAVASCRIPT : '<script' ( options {greedy=false;} : . )* '</script>'
>>> ;
>>> OPENTAG : ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
>>> | '<'
>>> ;
>>>
>>> Ter said he'd investigate the possibility of enhancing the prediction
>>> algorithm to deal with such cases.
>>>
>>> Tom.
>>>
>>>
>>>> Thanks for any further suggestions,
>>>>
>>>> Ruth
>>>>
>>>>
>>>> Andrew Lentvorski schrieb:
>>>>
>>>>
>>>>> Ruth Karl wrote:
>>>>>
>>>>>
>>>>>> Thanks, but I looked at it several times (even before I ever wrote to
>>>>>> this list) and still I can not see why when I start an input with
>>>>>> with '<sx' the lexer should loose itself in a rule wanting '<script'
>>>>>> as an input. (given the grammar I attached in my last posting).
>>>>>> Any other suggestions?
>>>>>>
>>>>>>
>>>>> Looks like a bug to me:
>>>>>
>>>>> grammar jsp;
>>>>>
>>>>> JAVASCRIPT : '<script>' ( options {greedy=false;} : . )*
>>>>> '</script>' {System.out.print("J");};
>>>>> ANY : . {System.out.print("A");};
>>>>>
>>>>> jsp : (ANY | JAVASCRIPT)* EOF;
>>>>>
>>>>> with input:
>>>>>
>>>>> <script>foo</script>
>>>>> <s>bar</s>
>>>>>
>>>>>
>>>>> Produces a token stream of:
>>>>> "<script>foo</script>", "a", "r", "<", "/", "s", ">"
>>>>>
>>>>> aka
>>>>>
>>>>> JAVASCRIPT, ANY, ANY, ANY, ANY, ANY, ANY
>>>>>
>>>>> Something vacuums up the "<s>b"
>>>>>
>>>>> The output is:
>>>>> line 2:2 mismatched character '>' expecting 'c'
>>>>> JAAAAAAAA
>>>>>
>>>>> You might want to file it and see what the response is.
>>>>>
>>>>> -a
>>>>>
>>>>>
>>>>>
>>>
>
>
More information about the antlr-interest
mailing list