[antlr-interest] Question about lexer/parser boundaries

Mon Jun 4 13:16:29 PDT 2007

I'm using ANTLR v3.0 to create a full XQuery 1.0 lexer/parser (a 
non-trivial task IMO). While I have experience writing compilers 
(long ago), this is the first time I've used a formal system such as ANTLR.

My first question relates to an important phrase in the book "The 
Definitive ANTLR Reference" (a must buy if you're serious about 
working with ANTLR 3.0), page 290:

... You should restrict lexer rules to matching single lexical constructs ...

Personally I found this to be an extremely important piece of 
information that ought to be more stressed in this and other 
documentation. Existing EBNF grammars, including the official XQuery 
grammar, do not tell us which should be lexer and which should be 
parser rules; so it's up to the one translating those grammars to 
ANTLR syntax to decide which how the rules should be structured 
(lexer vs. parser).

So given that context, the question is: what exactly constitutes 
"single lexical constructs" in ANTLR's context? I would think that 
the definition of DUMMY would effectively amount to a single lexical construct:

DUMMY : 'A' | 'G' | 'Z' | DIGIT;
fragment DIGIT : '0'..'9';

or indeed this:

DUMMY2 : 'A..Z' ~'X';

because the rules boil down to a set of permissible single Unicode 
characters (or character ranges).

There is also this, the pattern of which is very useful to "merge" 
multiple tokens into a single token for the parser in order to reduce 
lookahead 'k':

MULTIPLE : TOKEN1 TOKEN2;
TOKEN1 : 'Test1';
TOKEN2 : 'Test2';

So my question is this: Are DUMMY, DUMMY2, and MULTIPLE permissible 
lexer rules, or should they still be defined as parser rules? (then 
defined with lower case letters.)