[antlr-interest] Prediction DFA bug?
Andreas Meyer
andreas.meyer at smartshift.de
Fri Mar 6 09:16:56 PST 2009
Hi,
some days ago there was a discussion on how to solve the identifier vs.
keyword problem. Some said it's not feasible to have 800 individual
keyword tokens, and one should rather partition the token set into
"supertokens" representing (lets say) 40 tokens, or even only have one
"keyword" token. Some weeks ago I tried already, with an extreme
approach, namely every keyword is an ID token + semantic predicates. The
produced parser had some strange properties, namely, it was misled into
wrong rules, and then complained about mismatched keywords, ie, while
looking for a statement beginning with keyword_1, it decided to go for
the statement rule beginning with keyword_2, which in turn raised a
PredicateFailedException:
keyword_2: {is_keyword2()}? IDENTIFIER;
I quickly put this attempt into my archive and tried again with the 800
keyword tokens approach, which also did not work (timeout because of
huge prediction DFAs). Now, after the discussion yesterday on this list,
I again looked at the "supertoken" approach, but now I did it more
"hybrid", namely with two tokens: ID and KEYWORD --- with limited
success. I have simplified the original grammar a lot, so that it only
contains few lines of code and is self-contained (can be debugged in
ANTLRworks). The error/issue/behaviour-producing input is also attached
(XXXTEST.abap).
Here is the problem, as I see it:
program
: statements EOF
;
statements
: ( statement PERIOD ) *
;
// here the parser see's a KEYWORD token, and wants to further
disambiguate by looking ahead
statement
// somehow, on input KEYWORD ID KEYWORD ID, the parser is
attracted into this alternative,
// because it only considers the lookahead KEYWORD ID KEYWORD ...
: kw_stmt2^ id kw_abc
// .. although it should take this path, on input KEYWORD ID
KEYWORD ID PERIOD:
| kw_stmt1^ rule
| kw_stmt3^ id
;
rule
: id xyz*
;
xyz
: kw_xyz^ id
;
id : ID
| KEYWORD
;
kw_stmt1: {getKeywordType( input.LT(1) ) == KW_STMT1}? KEYWORD -> KW_STMT1;
....
Now, one could advise to switch the semantic lookahead into a gated
semantic lookahead:
kw_stmt1: {getKeywordType( input.LT(1) ) == KW_STMT1}?=> KEYWORD ->
KW_STMT1;
This in fact solves the above problem, so what? When using the
gated-semantic-predicate approach in my original grammar, with 800 of
these keyword distinguishing rules (note: rules! not tokens), however,
this again leads to timeouts in exactly the same rule as with the
original 800 individual keyword tokens. It must be due to the size of
some automaton, and it seems to grow not because of the number of
tokens, but because of the number of rules+attached gated semantic
predicates.
I'm sorry that currently I dont have small or at least synthetic grammar
that also demonstrates my timeout issue, but I think that the above
behaviour is in itself interesting, or (so to say) not correct,
regardless of what problem I am trying to solve with it. My opinion is
that ANTLR should somehow use a larger lookahead in the situation above.
I would really like to dive into ANTLR source code and check for myself,
but are somewhat overwhelmed by its complexity ... it seems that LL(*)
is built on a simple idea, but to implement it with all corner cases,
takes a lot of effort.
Best Regards,
Andreas Meyer
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ABAP4.g
Url: http://www.antlr.org/pipermail/antlr-interest/attachments/20090306/69087271/attachment.pl
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: XXXTEST.abap
Url: http://www.antlr.org/pipermail/antlr-interest/attachments/20090306/69087271/attachment-0001.pl
More information about the antlr-interest
mailing list