[antlr-interest] Question about lexer/parser boundaries

Mon Jun 4 13:58:36 PDT 2007

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Phil Oliver
> Sent: Monday, June 04, 2007 1:48 PM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Question about lexer/parser boundaries
> 
> Jim - thanks for the quick response. I would note a couple of things:
> first, "merging" the tokens at the lexer stage seems to be an
> effective and indeed necessary technique to accomodate the "grouping"
> notation in the XQuery 1.0 grammar. i.e. in some parser rule, there
> might be a reference to (to use my prior example for continuity):
> 
> ... < TOKEN1 TOKEN2 > ...
> 
> in the XQuery grammar, denoting that TOKEN1 and TOKEN2 are to be
> effectively treated as one unit.

I think you are confusing the tokens as implied by the language you are
parsing (as in TOKEN1 TOKEN2 are to be treated as one unit by the parser
that is parsing the query, with the tokenability of the input language,
which is not the same thing. You look for the construct above as:

compound_element : LBRACKET TOKEN1 TOKEN2 RBRACKET ;

>I think this is done in order to
> preserve the grammar as LL(1) parsable. ANTLR itself doesn't (unless
> I'm missing it) have such an ability (and sub-rules grouping in
> parentheses are not equivalent apparently), other than to define
> another lexer rule as my example gave:
> 
> MULTIPLE: TOKEN1 TOKEN2;

> 
> and then up in the parser rules, < TOKEN TOKEN2 > can be replaced
> with MULTIPLE.  This appears to work as expected. (Concrete examples
> are 'DECLARE boundary-space' vs. 'DECLARE default' vs. 'DECLARE
> namespace' etc. - unless you lex each one as single units, the parser
> needs LL(2) to distinguish between them. Correct me if I'm wrong
> here. Yes, I understand that ANTLR 3.0 is LL(*) and can backtrack but
> I want to keep this LL(1), as intended by the official grammar.)

I don't think that the tokenizing will affect that. However, if < and >
are not used ambiguously elsewhere for operators, then you might do it
that way so long as you used fragment and assuming that TOKEN1 and TOKEN
2 are some non-ambiguous tokenized strings. But your parser in the above
need only do:

declare_stuff
	: DECLARE
		(
			  DEFAULT {default namespace stuff}
			| NAMESPACE ...

				And so on.

> 
> I'm actually more concerned about my first examples with the
> character ranges, than the "merging" idea, though for completeness I
> wanted to include it in my question.

Your other examples were fine.

Jim