[antlr-interest] Question about lexer/parser boundaries

Mon Jun 4 13:47:50 PDT 2007

Jim - thanks for the quick response. I would note a couple of things: 
first, "merging" the tokens at the lexer stage seems to be an 
effective and indeed necessary technique to accomodate the "grouping" 
notation in the XQuery 1.0 grammar. i.e. in some parser rule, there 
might be a reference to (to use my prior example for continuity):

... < TOKEN1 TOKEN2 > ...

in the XQuery grammar, denoting that TOKEN1 and TOKEN2 are to be 
effectively treated as one unit. I think this is done in order to 
preserve the grammar as LL(1) parsable. ANTLR itself doesn't (unless 
I'm missing it) have such an ability (and sub-rules grouping in 
parentheses are not equivalent apparently), other than to define 
another lexer rule as my example gave:

MULTIPLE: TOKEN1 TOKEN2;

and then up in the parser rules, < TOKEN TOKEN2 > can be replaced 
with MULTIPLE.  This appears to work as expected. (Concrete examples 
are 'DECLARE boundary-space' vs. 'DECLARE default' vs. 'DECLARE 
namespace' etc. - unless you lex each one as single units, the parser 
needs LL(2) to distinguish between them. Correct me if I'm wrong 
here. Yes, I understand that ANTLR 3.0 is LL(*) and can backtrack but 
I want to keep this LL(1), as intended by the official grammar.)

I'm actually more concerned about my first examples with the 
character ranges, than the "merging" idea, though for completeness I 
wanted to include it in my question.