[antlr-interest] Prediction DFA bug?

Fri Mar 6 09:16:56 PST 2009

Hi,

some days ago there was a discussion on how to solve the identifier vs. 
keyword problem. Some said it's not feasible to have 800 individual 
keyword tokens, and one should rather partition the token set into 
"supertokens" representing (lets say) 40 tokens, or even only have one 
"keyword" token. Some weeks ago I tried already, with an extreme 
approach, namely every keyword is an ID token + semantic predicates. The 
produced parser had some strange properties, namely, it was misled into 
wrong rules, and then complained about mismatched keywords, ie, while 
looking for a statement beginning with keyword_1, it decided to go for 
the statement rule beginning with keyword_2, which in turn raised a 
PredicateFailedException:

keyword_2: {is_keyword2()}? IDENTIFIER;

I quickly put this attempt into my archive and tried again with the 800 
keyword tokens approach, which also did not work (timeout because of 
huge prediction DFAs). Now, after the discussion yesterday on this list, 
I again looked at the "supertoken" approach, but now I did it more 
"hybrid", namely with two tokens: ID and KEYWORD --- with limited 
success. I have simplified the original grammar a lot, so that it only 
contains few lines of code and is self-contained (can be debugged in 
ANTLRworks). The error/issue/behaviour-producing input is also attached 
(XXXTEST.abap).

Here is the problem, as I see it:

program
    :  statements  EOF
    ;

statements
    :  ( statement PERIOD  ) * 
    ;

// here the parser see's a KEYWORD token, and wants to further 
disambiguate by looking ahead
statement
      // somehow, on input KEYWORD ID KEYWORD ID, the parser is 
attracted into this alternative,
      // because it only considers the lookahead KEYWORD ID KEYWORD ...
    : kw_stmt2^  id kw_abc
      // .. although it should take this path, on input KEYWORD ID 
KEYWORD ID PERIOD:
    | kw_stmt1^  rule
    | kw_stmt3^  id
    ;

rule
    : id  xyz*
    ;

xyz   
    : kw_xyz^ id
    ;

id  : ID
    | KEYWORD
    ;

kw_stmt1: {getKeywordType( input.LT(1) ) == KW_STMT1}? KEYWORD -> KW_STMT1;
....

Now, one could advise to switch the semantic lookahead into a gated 
semantic lookahead:

kw_stmt1: {getKeywordType( input.LT(1) ) == KW_STMT1}?=> KEYWORD -> 
KW_STMT1;

This in fact solves the above problem, so what? When using the 
gated-semantic-predicate approach in my original grammar, with 800 of 
these keyword distinguishing rules (note: rules! not tokens), however, 
this again leads to timeouts in exactly the same rule as with the 
original 800 individual keyword tokens. It must be due to the size of 
some automaton, and it seems to grow not because of the number of 
tokens, but because of the number of rules+attached gated semantic 
predicates.

I'm sorry that currently I dont have small or at least synthetic grammar 
that also demonstrates my timeout issue, but I think that the above 
behaviour is in itself interesting, or (so to say) not correct, 
regardless of what problem I am trying to solve with it. My opinion is 
that ANTLR should somehow use a larger lookahead in the situation above. 
I would really like to dive into ANTLR source code and check for myself, 
but are somewhat overwhelmed by its complexity ... it seems that LL(*) 
is built on a simple idea, but to implement it with all corner cases, 
takes a lot of effort.

Best Regards,
Andreas Meyer
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ABAP4.g
Url: http://www.antlr.org/pipermail/antlr-interest/attachments/20090306/69087271/attachment.pl 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: XXXTEST.abap
Url: http://www.antlr.org/pipermail/antlr-interest/attachments/20090306/69087271/attachment-0001.pl