[antlr-interest] Question about lexer/parser boundaries
Phil Oliver
antlr at olivercomputing.com
Mon Jun 4 13:16:29 PDT 2007
I'm using ANTLR v3.0 to create a full XQuery 1.0 lexer/parser (a
non-trivial task IMO). While I have experience writing compilers
(long ago), this is the first time I've used a formal system such as ANTLR.
My first question relates to an important phrase in the book "The
Definitive ANTLR Reference" (a must buy if you're serious about
working with ANTLR 3.0), page 290:
... You should restrict lexer rules to matching single lexical constructs ...
Personally I found this to be an extremely important piece of
information that ought to be more stressed in this and other
documentation. Existing EBNF grammars, including the official XQuery
grammar, do not tell us which should be lexer and which should be
parser rules; so it's up to the one translating those grammars to
ANTLR syntax to decide which how the rules should be structured
(lexer vs. parser).
So given that context, the question is: what exactly constitutes
"single lexical constructs" in ANTLR's context? I would think that
the definition of DUMMY would effectively amount to a single lexical construct:
DUMMY : 'A' | 'G' | 'Z' | DIGIT;
fragment DIGIT : '0'..'9';
or indeed this:
DUMMY2 : 'A..Z' ~'X';
because the rules boil down to a set of permissible single Unicode
characters (or character ranges).
There is also this, the pattern of which is very useful to "merge"
multiple tokens into a single token for the parser in order to reduce
lookahead 'k':
MULTIPLE : TOKEN1 TOKEN2;
TOKEN1 : 'Test1';
TOKEN2 : 'Test2';
So my question is this: Are DUMMY, DUMMY2, and MULTIPLE permissible
lexer rules, or should they still be defined as parser rules? (then
defined with lower case letters.)
More information about the antlr-interest
mailing list