[antlr-interest] Lexer and Java keywords

Wed Dec 9 08:43:57 PST 2009

Do you currently have the IDENTIFIER lexer rule located before (as in line number) ABSTRACT, etc.? I'm guessing that's the cause of your current problem. Also, don't specify a value for k in your lexer.

On a side note, this really isn't the ANTLR way to do things, but your generated code will be smaller and faster if you do this. I might have the syntax slightly wrong since I'm not a Java programmer. If you are using a combined grammar (lexer and parser in the same file), a downside of doing this is you have to always use ABSTRACT in the parser rules, where normally 'abstract' would alias itself to the token.

@lexer
{
Hashtable<String, Integer> keywords = new Hashtable()
    {{
    put("abstract", ABSTRACT);
    put("break", BREAK);
    }};
}

// the fragment rules assign values to the token types that you can use in the parser.
fragment ABSTRACT : ;
fragment BREAK : ;

IDENTIFIER
@after
{
Integer value = keywords.get($text);
if (value != null)
    setType(value); // might be state.setType
}
    : ...
    ;

Sam

-----Original Message-----
From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Marcin Rzeznicki
Sent: Wednesday, December 09, 2009 10:27 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Lexer and Java keywords

Greetings to all,
I've started to play with ANTLR 3.2 after long break (last time I was
using ANTLR, it was v2). I've been playing with Java grammar, adapting
it to my own needs. I am currently stuck with something that I think
is (or should be) very simple to achieve, yet I somehow cannot make
any progress. By the way, I am using ANTLRWorks 1.3.1.
Let's consider the part of Java lexer grammar that deals with keywords:

ABSTRACT
    : 'abstract'
    ;

ASSERT
     : 'assert'
     ;

BOOLEAN
    : 'boolean'
    ;
...

IDENTIFIER
    : JavaLetter (JavaLetterOrDigit)*
    ;

When I check the grammar in ANTLRWorks it gives me obvious warnings
that every specified keyword is also a valid identifier :

Multiple token rules can match input such as "'l'": LONG, IDENTIFIER
As a result, token(s) IDENTIFIER were disabled for that input

I expected this and that's fine, but I am also getting errors:
The following token definitions can never be matched because prior
tokens match the same input: ASSERT,BREAK ...

And that's my problem, I am very surprised that this is the case.
Let's take ASSERT - I know that ANTLR complains that upon seeing 'a'
it cannot decide whether token ABSTRACT or ASSERT is to be produced,
so it gives precendence to ABSTRACT so ASSERT is eliminated out
completely. But according to Mr Parr's book the default lookahead, if
'k' options is not specified, is assumed to be *. So, while I am aware
that this is clearly not LL(1), the default lookahead should solve the
problem. Ok, I am not giving up and specify explicitly global k=2.
Warnings and errors are the same :

Multiple token rules can match input such as "'a'": ABSTRACT, ASSERT, IDENTIFIER
As a result, token(s) ASSERT,IDENTIFIER were disabled for that input.

Why? With 2 characters LA keywords 'abstract' and 'assert' should be
easily distinguishable, right?
So, my question is: what am I doing wrong? How to handle java keywords in lexer?
-- 
Greetings
Marcin Rzeźnicki

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address