[antlr-interest] Prediction DFA bug?

Fri Mar 6 14:06:10 PST 2009

Andreas--

After doing some research on the web, I have to believe that ABAP is one of those horrible languages created by the clueless for the masochistic and easily victimized.  I think that most of the 800 keywords you describe are probably function names or the equivalent and that the lack of a formal description for the language leads to a considerable confusion of syntax with semantics.  It does not help that SQL is embedded as a first class part of the language.

My advice is to back way off and try to abstract ABAP syntax, identifying as few keywords as possible.  "DO" and "WHILE", for example, are clearly keywords, but "WRITE" is an identifier that names a function.  Defer as much semantic processing to later passes as possible.  Except for the embedded SQL, I suspect that ABAP is not much more complicated than Java:  it just has not been properly codified.

--Loring

----- Original Message ----
> From: Andreas Meyer <andreas.meyer at smartshift.de>
> To: antlr-interest at antlr.org
> Sent: Friday, March 6, 2009 9:16:56 AM
> Subject: [antlr-interest] Prediction DFA bug?
> 
> Hi,
> 
> some days ago there was a discussion on how to solve the identifier vs. keyword 
> problem. Some said it's not feasible to have 800 individual keyword tokens, and 
> one should rather partition the token set into "supertokens" representing (lets 
> say) 40 tokens, or even only have one "keyword" token. Some weeks ago I tried 
> already, with an extreme approach, namely every keyword is an ID token + 
> semantic predicates. The produced parser had some strange properties, namely, it 
> was misled into wrong rules, and then complained about mismatched keywords, ie, 
> while looking for a statement beginning with keyword_1, it decided to go for the 
> statement rule beginning with keyword_2, which in turn raised a 
> PredicateFailedException:
> 
> keyword_2: {is_keyword2()}? IDENTIFIER;
> 
> I quickly put this attempt into my archive and tried again with the 800 keyword 
> tokens approach, which also did not work (timeout because of huge prediction 
> DFAs). Now, after the discussion yesterday on this list, I again looked at the 
> "supertoken" approach, but now I did it more "hybrid", namely with two tokens: 
> ID and KEYWORD --- with limited success. I have simplified the original grammar 
> a lot, so that it only contains few lines of code and is self-contained (can be 
> debugged in ANTLRworks). The error/issue/behaviour-producing input is also 
> attached (XXXTEST.abap).
> 
> Here is the problem, as I see it:
> 
> program
>   :  statements  EOF
>   ;
> 
> statements
>   :  ( statement PERIOD  ) *    ;
>   // here the parser see's a KEYWORD token, and wants to further disambiguate by 
> looking ahead
> statement
>     // somehow, on input KEYWORD ID KEYWORD ID, the parser is attracted into 
> this alternative,
>     // because it only considers the lookahead KEYWORD ID KEYWORD ...
>   : kw_stmt2^  id kw_abc
>     // .. although it should take this path, on input KEYWORD ID KEYWORD ID 
> PERIOD:
>   | kw_stmt1^  rule
>   | kw_stmt3^  id
>   ;
> 
> rule
>   : id  xyz*
>   ;
> 
> xyz      : kw_xyz^ id
>   ;
> 
> id  : ID
>   | KEYWORD
>   ;
> 
> kw_stmt1: {getKeywordType( input.LT(1) ) == KW_STMT1}? KEYWORD -> KW_STMT1;
> ....
> 
> Now, one could advise to switch the semantic lookahead into a gated semantic 
> lookahead:
> 
> kw_stmt1: {getKeywordType( input.LT(1) ) == KW_STMT1}?=> KEYWORD -> KW_STMT1;
> 
> This in fact solves the above problem, so what? When using the 
> gated-semantic-predicate approach in my original grammar, with 800 of these 
> keyword distinguishing rules (note: rules! not tokens), however, this again 
> leads to timeouts in exactly the same rule as with the original 800 individual 
> keyword tokens. It must be due to the size of some automaton, and it seems to 
> grow not because of the number of tokens, but because of the number of 
> rules+attached gated semantic predicates.
> 
> I'm sorry that currently I dont have small or at least synthetic grammar that 
> also demonstrates my timeout issue, but I think that the above behaviour is in 
> itself interesting, or (so to say) not correct, regardless of what problem I am 
> trying to solve with it. My opinion is that ANTLR should somehow use a larger 
> lookahead in the situation above. I would really like to dive into ANTLR source 
> code and check for myself, but are somewhat overwhelmed by its complexity ... it 
> seems that LL(*) is built on a simple idea, but to implement it with all corner 
> cases, takes a lot of effort.
> 
> Best Regards,
> Andreas Meyer