[antlr-interest] Difference between tokens {...} and ordinary lexerrules

Fri Feb 29 16:25:43 PST 2008

Hi Felix,

Placing the literal definitions for your tokens in the token section has the 
effect of adding productions for those literals to a lexer grammar.  For 
example, ANTLR will generate for the combined lexer and parser grammar 
input:

grammar test;
tokens {
 CLINTON = 'clinton';
 OBAMA = 'obama';
}
stuff: (CLINTON|OBAMA)*;

the file test__.g.  This file is a lexer grammar and it contains the 
translation of the token list in terms of productions:

lexer grammar test;

CLINTON : 'clinton' ;
OBAMA : 'obama' ;

ANTLR can read this file and generate the lexer for your target.  This means 
that they are equivalent.

However, if the grammar is a lexer grammar, then ANTLR does not accept nor 
add equivalent productions.  For example, the grammar

lexer grammar test;
tokens {
 CLINTON = 'clinton';
 OBAMA;
}
OBAMA: 'obama';

is not accepted.  Antlr outputs:

ANTLR Parser Generator  Version 3.0.1 (August 13, 2007)  1989-2007
error(108): test.g:3:12: literals are illegal in lexer tokens{} section: 
'clinton'

If you tried to only use the tokens section to define your lexer 
productions, e.g.,

lexer grammar test;
tokens {
 CLINTON = 'clinton';
 OBAMA = 'obama';
}

ANTLR will produce something different entirely:

ANTLR Parser Generator  Version 3.0.1 (August 13, 2007)  1989-2007
error(100): test.g:6:1: syntax error: antlr: test.g:6:1: unexpected token: 
null
error(10):  internal error: test.g : java.lang.NullPointerException
org.antlr.tool.Grammar.setGrammarContent(Grammar.java:524)
org.antlr.tool.Grammar.<init>(Grammar.java:456)
org.antlr.Tool.getGrammar(Tool.java:331)
org.antlr.Tool.process(Tool.java:267)
org.antlr.Tool.main(Tool.java:70)

It seems to me the safest thing to do is to place the defintions as 
productions in your grammar.  You also have the capability to add predicates 
to the productions in order to fine tune the lexer.  But, you probably don't 
want to add literals to your grammar like this anyways if you have a lot of 
them.  It seems that the lexer generated may become too large to compile 
(e.g., Java and a 64K code size limit).

--Ken Domino