[antlr-interest] Lexer and Java keywords

Wed Dec 9 10:26:26 PST 2009

Sam,

Your suggestion will generate smaller code, but if you change the default values:

  -Xmaxswitchcaselabels m don't generate switch() statements for dfas bigger  than m [300]
  -Xminswitchalts m       don't generate switch() statements for dfas smaller than m [3]

So that you use:

-Xmaxswitchcaselabels 32000 -Xminswitchalts 1

Then all the code will generate with switches, which are much smaller than the DFA tables. While this may generate more executable code than your hashmap, I think it may be generally a lot faster as there will be no CPU memory cache misses if the compilers are any good. I don't know whether this is true for C# JIT, (perhaps we should try this), I think this is true for Java (Sun Hotspot) (I am going to be trying this) and I know this is very true for C. Basically memory cache misses are the big thing, assuming the algorithms are good.

The C target overrides those defaults automatically, but I did not do this for other targets because I did not have enough information about performance of switches. In theory the hotspot compiler can do a better job than the C compiler because it can readjust the code based upon what values it sees come to the switch most often. However, I have never tried to show if this is the case or not with ANTLR lexers and parsers, there may not be enough hits on any particular switch to make any difference.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Sam Harwell
> Sent: Wednesday, December 09, 2009 8:44 AM
> To: Marcin Rzeznicki; antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Lexer and Java keywords
> 
> Do you currently have the IDENTIFIER lexer rule located before (as in
> line number) ABSTRACT, etc.? I'm guessing that's the cause of your
> current problem. Also, don't specify a value for k in your lexer.
> 
> On a side note, this really isn't the ANTLR way to do things, but your
> generated code will be smaller and faster if you do this. I might have
> the syntax slightly wrong since I'm not a Java programmer. If you are
> using a combined grammar (lexer and parser in the same file), a
> downside of doing this is you have to always use ABSTRACT in the parser
> rules, where normally 'abstract' would alias itself to the token.
> 
> @lexer
> {
> Hashtable<String, Integer> keywords = new Hashtable()
>     {{
>     put("abstract", ABSTRACT);
>     put("break", BREAK);
>     }};
> }
> 
> // the fragment rules assign values to the token types that you can use
> in the parser.
> fragment ABSTRACT : ;
> fragment BREAK : ;
> 
> IDENTIFIER
> @after
> {
> Integer value = keywords.get($text);
> if (value != null)
>     setType(value); // might be state.setType
> }
>     : ...
>     ;
> 
> Sam
> 
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Marcin Rzeznicki
> Sent: Wednesday, December 09, 2009 10:27 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Lexer and Java keywords
> 
> Greetings to all,
> I've started to play with ANTLR 3.2 after long break (last time I was
> using ANTLR, it was v2). I've been playing with Java grammar, adapting
> it to my own needs. I am currently stuck with something that I think
> is (or should be) very simple to achieve, yet I somehow cannot make
> any progress. By the way, I am using ANTLRWorks 1.3.1.
> Let's consider the part of Java lexer grammar that deals with keywords:
> 
> ABSTRACT
>     : 'abstract'
>     ;
> 
> ASSERT
>      : 'assert'
>      ;
> 
> BOOLEAN
>     : 'boolean'
>     ;
> ...
> 
> IDENTIFIER
>     : JavaLetter (JavaLetterOrDigit)*
>     ;
> 
> 
> When I check the grammar in ANTLRWorks it gives me obvious warnings
> that every specified keyword is also a valid identifier :
> 
> Multiple token rules can match input such as "'l'": LONG, IDENTIFIER
> As a result, token(s) IDENTIFIER were disabled for that input
> 
> I expected this and that's fine, but I am also getting errors:
> The following token definitions can never be matched because prior
> tokens match the same input: ASSERT,BREAK ...
> 
> And that's my problem, I am very surprised that this is the case.
> Let's take ASSERT - I know that ANTLR complains that upon seeing 'a'
> it cannot decide whether token ABSTRACT or ASSERT is to be produced,
> so it gives precendence to ABSTRACT so ASSERT is eliminated out
> completely. But according to Mr Parr's book the default lookahead, if
> 'k' options is not specified, is assumed to be *. So, while I am aware
> that this is clearly not LL(1), the default lookahead should solve the
> problem. Ok, I am not giving up and specify explicitly global k=2.
> Warnings and errors are the same :
> 
> Multiple token rules can match input such as "'a'": ABSTRACT, ASSERT,
> IDENTIFIER
> As a result, token(s) ASSERT,IDENTIFIER were disabled for that input.
> 
> Why? With 2 characters LA keywords 'abstract' and 'assert' should be
> easily distinguishable, right?
> So, my question is: what am I doing wrong? How to handle java keywords
> in lexer?
> --
> Greetings
> Marcin Rzeźnicki
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address