[antlr-interest] lookahead DFA too big?

Thu Mar 5 04:47:14 PST 2009

On Thu, Mar 5, 2009 at 9:50 PM, Andreas Meyer
<andreas.meyer at smartshift.de> wrote:
> Maybe it's possible to partition the set of keywords, but that would be
> some effort: figuring out for 800 keywords, where they appear, what is
> the context they are used in etc. Note that the problem only appeared
> after switching to ANTLR 3.1, ANTLR 2.7 was fine with it and the
> generated parser works well.
Presumably in ANTLR 2.7 you used the literals table, ANTLR 3 handles
the keyword identifier distinction through it's nifty DFAs,
unfortunately with such a large number of keywords the DFA needed gets
pretty complicated, especially if you have a number of other rules
which allow subsets of your keyword vocabulary.
You can duplicate the 2.7 behaviour by having an action in your
identifier rule test a hashtable. Something like:
tokens {
  KEYWORDA;
  KEYWORDB;
}

@lexer::members {
  private Hashtable<String,Integer> literalsTable = new Hashtable() {{
    put("keyworda", KEYWORDA);
    put("keywordb", KEYWORDB);
  }};

  private int checkLiterals(String text) {
    Integer type = literalsTable.get(text);
    if(type != null)
      return type;
    else
      return ID;
  }
}
ID: 'a'..'z' { $type=checkLiterals($text); };

Though there was a bug that caused warning when the tokens section was
used for like that so you may need to instead have fragment rules to
define the token types (the content is irrelevant though I don't think
they can be empty).

>
> Like advertised in the ANTLR book, I also used semantic predicates to
> locally check if an identifier is actually the keyword I want. This did
> not work out: the language is very verbose and most sentences look like
> native english, so the parser then sees : ID ID ID ID and it has to
> check 800 semantic predicates in order to find out which keyword comes
> next. Having only one token (ID) might have been to extreme, and having
> different groups seems indeed interesting, although probably tedious.
>
Ouch, it seems surprising that the language is so complex that it
actually allows a major subset of your 800 keywords at any one point.
Generally context should vastly reduce the number of checks needed.

Tom.
> On some power-point presentation from Terence Parr, I have seen a slide
> mentioning huge generated DFAs. Is there some additional material
> available that documents some of the possible situations that make the
> DFAs explode?
>
> Andreas
>
> Loring Craymer schrieb:
>> 800 token types is a staggeringly large number and indicates that you took the wrong path in dealing with the keyword versus identifier problem.  In the cases where I have had this many "keywords", they can usually be decomposed into subcategories and that information kept in a symbol table for use with sempreds.  With this many token types, you want to differentiate locally (use sempreds to recognize keywords where they should appear), not globally (all keywords should be recognized as "IDENTIFIER" in the lexer).
>>
>> --Loring
>>
>>
>>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>