[antlr-interest] lexer problem (BUG?)

Thomas Brandon tbrandonau at gmail.com
Fri Jul 27 08:39:44 PDT 2007


On 7/28/07, Ruth Karl <ruth.karl at gmx.de> wrote:
> Thanks, Thomas.
> I did try your workaround (with predicate...), but antlr still chokes on
> an input like <s>....
> (same problem: it does not see the '<s' anymore and then falls over the
> sudden closing tag...
> I guess I will try to treat javascript in another way, then... (a
> pity... ;-)
Strange. I am able to correctly parse "<set><script>test</script><s>"
after making the modifications I gave. Note the interpreter in
ANTLRWorks doesn't execute actions or predicates so it won't work
there.
The full grammar I used was:
grammar JSP;

options {
               output=AST;
backtrack=true;
               memoize=true;
}

// Lexer rules

TEXT            :
((~('<'|'>'|'%'|'/'|'"'|'\''|'('|')'|'['|']'|'{'|'}'|'\n'|'\t'|'\r'))
| ESCQUOTE)+
       ;
WS      :       (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; }
       ;
JAVACOMMENT     :       '/*' ( options {greedy=false;} : . )* '*/'
{$channel=HIDDEN;}
       ;
HTMLCOMMENT     :       '<!--' ( options {greedy=false;} : . )* '-->'
{$channel=HIDDEN;}
       ;
SCRIPTCOMMENT   :       '<%--' ( options {greedy=false;} : . )* '--%>'
{$channel=HIDDEN;}
       ;

DOCTYPE :       '<!DOCTYPE' ( options {greedy=false;} : . )* '>'
       ;
DIRECTIVE       :       '<%@' ( options {greedy=false;} : . )* '%>'
       ;
DECLARATION     :       '<%!' ( options {greedy=false;} : . )* '%>'
       ;

SCRIPTLETSTART  :       '<%'
       ;
SCRIPTLETEND    :       '%>'
       ;
EMPTYHTMLEND    :       '/>'
       ;
ESCQUOTE        :       '\\' (options {greedy=false;} : ('"' | '\''))
       ;

fragment
JAVASCRIPT      :       '<script' ( options {greedy=false;} : . )* '</script>'
       ;
OPENTAG         :       ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
				|		'<'
       ;
CLOSETAG        :       '>'
               ;
SLASH           :       '/'
       ;
PERCENT :       '%'
       ;
LPAR    :       '('
       ;
RPAR    :       ')'
       ;
LCURL   :       '{'
       ;
RCURL   :       '}'
       ;
LBRA    :       '['
       ;
RBRA    :       ']'
       ;

// LEXER: imaginary tokens/nodes for AST

SCRIPTLET       :
       ;
HTMLTAG :
       ;
QUOTED  :
       ;
BRACKETEX       :
       ;
JS      :
       ;



// Parser rules

jsp     :       (content)* EOF
               ;
content         :       scriptlet
       |       htmltag
       |       quoted
       |       text
       |       PERCENT
       |       bracketexpr
       |       DOCTYPE
       |       RPAR
       |       RCURL
       |       RBRA
       |       slashComment
       |       directive
       |       declaration
       |       javascript
               ;
scriptlet       :       SCRIPTLETSTART (content)*  SCRIPTLETEND
->^(SCRIPTLET content*)
       ;
htmltag :       OPENTAG (SLASH)? (htmltagcontent |slashComment)*
(EMPTYHTMLEND |CLOSETAG) ->^(HTMLTAG htmltagcontent*)
       ;
htmltagcontent  :       TEXT (PERCENT | TEXT)*
       |       bracketexpr
       |       quoted
       |       scriptlet
       ;
javascript      :       JAVASCRIPT ->^(JS JAVASCRIPT)
       ;
bracketexpr     :       LPAR expr* (RPAR)? ->^(BRACKETEX LPAR expr*)
       |       LCURL expr* (RCURL)? ->^(BRACKETEX LCURL expr*)
       |       LBRA expr* (RBRA)? ->^(BRACKETEX LBRA expr*)
       ;
expr    :       text
       |       SLASH
       |       OPENTAG
       |       CLOSETAG
       |       PERCENT
       |       '\\'
       |       bracketexpr
       |       quoted
       ;
slashComment    :       SLASH SLASH (TEXT)
       ;
text    :       TEXT  -> TEXT
       ;
quoted  :       dquoted
       |       squoted
       ;
dquoted :       '"' ( options {greedy=false;} : (dquotecontent) )* '"'
->^(QUOTED dquotecontent*)
       ;
dquotecontent   :       text
       |       scriptlet
       |       bracketexpr
       |       SLASH
       |       OPENTAG
       |       CLOSETAG
       |       PERCENT
       |       RPAR
       |       '\\'
       |       squoted
       ;
squoted :       '\'' ( options {greedy=false;} : (squotecontent)  )*
'\''  ->^(QUOTED squotecontent*)
       ;
squotecontent   :       text
       |       scriptlet
       |       bracketexpr
       |       SLASH
       |       OPENTAG
       |       CLOSETAG
       |       PERCENT
       |       RPAR
       |       '\\'
       |       dquoted
       ;
directive       :       DIRECTIVE
       ;
declaration     :       DECLARATION
       ;

Tom.
> Ruth
>
> Thomas Brandon schrieb:
> > On 7/27/07, Ruth Karl <ruth.karl at gmx.de> wrote:
> >
> >> Hi Andrew,
> >>
> >> thanks a lot for finding a smaller example to illustrate the problem.
> >> (Did you do it for java target or for c# - as I did?)
> >>
> >> Now: what can I do?
> >> I could (...) try to find a workaround in my grammar, but if it IS a bug
> >> - than a similar thing might happen in other cases as well....
> >>
> >>
> > It's not a bug. Though it may be considered a limitation.
> > The problem is that ANTLR's prediction algorithm doesn't look past
> > token boundaries so it makes it's predictions based on only a single
> > token. As the only possible single token matches for '<' followed by
> > anything are JAVASCRIPT and OPENTAG (talking about your original
> > grammar here, not the shorter sample) as soon as ANTLR see's '<s' it
> > predicts that it must be JAVASCRIPT, then gives an error when that
> > won't match. Looking at the mTokens method ANTLR generates may help
> > you see what is going on The problem is discussed in
> > http://www.antlr.org/pipermail/antlr-interest/2007-July/022349.html
> > .
> > Unfortunately as ANTLR doesn't consider there to be any ambiguity
> > backtracking won't help and a predicate in OPENTAG won't be hoisted. A
> > fix for your original grammar is to replace the previous rules with:
> > fragment
> > JAVASCRIPT      :       '<script' ( options {greedy=false;} : . )* '</script>'
> >        ;
> > OPENTAG         :       ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
> >                               |               '<'
> >        ;
> >
> > Ter said he'd investigate the possibility of enhancing the prediction
> > algorithm to deal with such cases.
> >
> > Tom.
> >
> >> Thanks for any further suggestions,
> >>
> >> Ruth
> >>
> >>
> >> Andrew Lentvorski schrieb:
> >>
> >>> Ruth Karl wrote:
> >>>
> >>>> Thanks, but I looked at it several times (even before I ever wrote to
> >>>> this list) and still I can not see why when I start an input with
> >>>> with '<sx' the lexer should loose itself in a rule wanting '<script'
> >>>> as an input. (given the grammar I attached in my last posting).
> >>>> Any other suggestions?
> >>>>
> >>> Looks like a bug to me:
> >>>
> >>> grammar jsp;
> >>>
> >>> JAVASCRIPT    :    '<script>' ( options {greedy=false;} : . )*
> >>> '</script>' {System.out.print("J");};
> >>> ANY    :    . {System.out.print("A");};
> >>>
> >>> jsp        :    (ANY | JAVASCRIPT)* EOF;
> >>>
> >>> with input:
> >>>
> >>> <script>foo</script>
> >>> <s>bar</s>
> >>>
> >>>
> >>> Produces a token stream of:
> >>> "<script>foo</script>", "a", "r", "<", "/", "s", ">"
> >>>
> >>> aka
> >>>
> >>> JAVASCRIPT, ANY, ANY, ANY, ANY, ANY, ANY
> >>>
> >>> Something vacuums up the "<s>b"
> >>>
> >>> The output is:
> >>> line 2:2 mismatched character '>' expecting 'c'
> >>> JAAAAAAAA
> >>>
> >>> You might want to file it and see what the response is.
> >>>
> >>> -a
> >>>
> >>>
> >
> >
>


More information about the antlr-interest mailing list