[antlr-interest] lexer problem (BUG?)
Thomas Brandon
tbrandonau at gmail.com
Fri Jul 27 08:39:44 PDT 2007
On 7/28/07, Ruth Karl <ruth.karl at gmx.de> wrote:
> Thanks, Thomas.
> I did try your workaround (with predicate...), but antlr still chokes on
> an input like <s>....
> (same problem: it does not see the '<s' anymore and then falls over the
> sudden closing tag...
> I guess I will try to treat javascript in another way, then... (a
> pity... ;-)
Strange. I am able to correctly parse "<set><script>test</script><s>"
after making the modifications I gave. Note the interpreter in
ANTLRWorks doesn't execute actions or predicates so it won't work
there.
The full grammar I used was:
grammar JSP;
options {
output=AST;
backtrack=true;
memoize=true;
}
// Lexer rules
TEXT :
((~('<'|'>'|'%'|'/'|'"'|'\''|'('|')'|'['|']'|'{'|'}'|'\n'|'\t'|'\r'))
| ESCQUOTE)+
;
WS : (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; }
;
JAVACOMMENT : '/*' ( options {greedy=false;} : . )* '*/'
{$channel=HIDDEN;}
;
HTMLCOMMENT : '<!--' ( options {greedy=false;} : . )* '-->'
{$channel=HIDDEN;}
;
SCRIPTCOMMENT : '<%--' ( options {greedy=false;} : . )* '--%>'
{$channel=HIDDEN;}
;
DOCTYPE : '<!DOCTYPE' ( options {greedy=false;} : . )* '>'
;
DIRECTIVE : '<%@' ( options {greedy=false;} : . )* '%>'
;
DECLARATION : '<%!' ( options {greedy=false;} : . )* '%>'
;
SCRIPTLETSTART : '<%'
;
SCRIPTLETEND : '%>'
;
EMPTYHTMLEND : '/>'
;
ESCQUOTE : '\\' (options {greedy=false;} : ('"' | '\''))
;
fragment
JAVASCRIPT : '<script' ( options {greedy=false;} : . )* '</script>'
;
OPENTAG : ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
| '<'
;
CLOSETAG : '>'
;
SLASH : '/'
;
PERCENT : '%'
;
LPAR : '('
;
RPAR : ')'
;
LCURL : '{'
;
RCURL : '}'
;
LBRA : '['
;
RBRA : ']'
;
// LEXER: imaginary tokens/nodes for AST
SCRIPTLET :
;
HTMLTAG :
;
QUOTED :
;
BRACKETEX :
;
JS :
;
// Parser rules
jsp : (content)* EOF
;
content : scriptlet
| htmltag
| quoted
| text
| PERCENT
| bracketexpr
| DOCTYPE
| RPAR
| RCURL
| RBRA
| slashComment
| directive
| declaration
| javascript
;
scriptlet : SCRIPTLETSTART (content)* SCRIPTLETEND
->^(SCRIPTLET content*)
;
htmltag : OPENTAG (SLASH)? (htmltagcontent |slashComment)*
(EMPTYHTMLEND |CLOSETAG) ->^(HTMLTAG htmltagcontent*)
;
htmltagcontent : TEXT (PERCENT | TEXT)*
| bracketexpr
| quoted
| scriptlet
;
javascript : JAVASCRIPT ->^(JS JAVASCRIPT)
;
bracketexpr : LPAR expr* (RPAR)? ->^(BRACKETEX LPAR expr*)
| LCURL expr* (RCURL)? ->^(BRACKETEX LCURL expr*)
| LBRA expr* (RBRA)? ->^(BRACKETEX LBRA expr*)
;
expr : text
| SLASH
| OPENTAG
| CLOSETAG
| PERCENT
| '\\'
| bracketexpr
| quoted
;
slashComment : SLASH SLASH (TEXT)
;
text : TEXT -> TEXT
;
quoted : dquoted
| squoted
;
dquoted : '"' ( options {greedy=false;} : (dquotecontent) )* '"'
->^(QUOTED dquotecontent*)
;
dquotecontent : text
| scriptlet
| bracketexpr
| SLASH
| OPENTAG
| CLOSETAG
| PERCENT
| RPAR
| '\\'
| squoted
;
squoted : '\'' ( options {greedy=false;} : (squotecontent) )*
'\'' ->^(QUOTED squotecontent*)
;
squotecontent : text
| scriptlet
| bracketexpr
| SLASH
| OPENTAG
| CLOSETAG
| PERCENT
| RPAR
| '\\'
| dquoted
;
directive : DIRECTIVE
;
declaration : DECLARATION
;
Tom.
> Ruth
>
> Thomas Brandon schrieb:
> > On 7/27/07, Ruth Karl <ruth.karl at gmx.de> wrote:
> >
> >> Hi Andrew,
> >>
> >> thanks a lot for finding a smaller example to illustrate the problem.
> >> (Did you do it for java target or for c# - as I did?)
> >>
> >> Now: what can I do?
> >> I could (...) try to find a workaround in my grammar, but if it IS a bug
> >> - than a similar thing might happen in other cases as well....
> >>
> >>
> > It's not a bug. Though it may be considered a limitation.
> > The problem is that ANTLR's prediction algorithm doesn't look past
> > token boundaries so it makes it's predictions based on only a single
> > token. As the only possible single token matches for '<' followed by
> > anything are JAVASCRIPT and OPENTAG (talking about your original
> > grammar here, not the shorter sample) as soon as ANTLR see's '<s' it
> > predicts that it must be JAVASCRIPT, then gives an error when that
> > won't match. Looking at the mTokens method ANTLR generates may help
> > you see what is going on The problem is discussed in
> > http://www.antlr.org/pipermail/antlr-interest/2007-July/022349.html
> > .
> > Unfortunately as ANTLR doesn't consider there to be any ambiguity
> > backtracking won't help and a predicate in OPENTAG won't be hoisted. A
> > fix for your original grammar is to replace the previous rules with:
> > fragment
> > JAVASCRIPT : '<script' ( options {greedy=false;} : . )* '</script>'
> > ;
> > OPENTAG : ('<script>')=>JAVASCRIPT {$type=JAVASCRIPT;}
> > | '<'
> > ;
> >
> > Ter said he'd investigate the possibility of enhancing the prediction
> > algorithm to deal with such cases.
> >
> > Tom.
> >
> >> Thanks for any further suggestions,
> >>
> >> Ruth
> >>
> >>
> >> Andrew Lentvorski schrieb:
> >>
> >>> Ruth Karl wrote:
> >>>
> >>>> Thanks, but I looked at it several times (even before I ever wrote to
> >>>> this list) and still I can not see why when I start an input with
> >>>> with '<sx' the lexer should loose itself in a rule wanting '<script'
> >>>> as an input. (given the grammar I attached in my last posting).
> >>>> Any other suggestions?
> >>>>
> >>> Looks like a bug to me:
> >>>
> >>> grammar jsp;
> >>>
> >>> JAVASCRIPT : '<script>' ( options {greedy=false;} : . )*
> >>> '</script>' {System.out.print("J");};
> >>> ANY : . {System.out.print("A");};
> >>>
> >>> jsp : (ANY | JAVASCRIPT)* EOF;
> >>>
> >>> with input:
> >>>
> >>> <script>foo</script>
> >>> <s>bar</s>
> >>>
> >>>
> >>> Produces a token stream of:
> >>> "<script>foo</script>", "a", "r", "<", "/", "s", ">"
> >>>
> >>> aka
> >>>
> >>> JAVASCRIPT, ANY, ANY, ANY, ANY, ANY, ANY
> >>>
> >>> Something vacuums up the "<s>b"
> >>>
> >>> The output is:
> >>> line 2:2 mismatched character '>' expecting 'c'
> >>> JAAAAAAAA
> >>>
> >>> You might want to file it and see what the response is.
> >>>
> >>> -a
> >>>
> >>>
> >
> >
>
More information about the antlr-interest
mailing list