[antlr-interest] Lexer rules and unreachable alternatives (trying to understand lexer)

Thu Apr 19 07:25:04 PDT 2007

El 19/4/2007, a las 14:38, Johannes Luber escribió:

> With this rule WS should get all consecutive spaces. But I haven't
> tested if FOO is still chosen over WS. Maybe
>
> start
>    : (WS)=> WS
>    |  FOO
>    ;
>
> is still needed.

I think the problem is that by the time the start rule is run in the  
parser, lexing has already taken place, so by then it is too late for  
the predicate to influence the outcome (you already have either a WS  
or a FOO token).

I did some more testing, and these are the results; for start rules  
like this:

	start : WS | FOO ; // order of WS and FOO in parser rule irrelevant
	start : (WS | FOO)+ ;
	start : .+ ;

If your input is *only* spaces then, all else being equal, the first- 
listed lexer rule wins.

But if your input contains more than just spaces (like "foo bar",  
"foo   ", "    bar"), the FOO is always going to win, regardless of  
the order of the lexer rules.

As you commented, the only way to overcome this greedy matching  
behaviour seems to be to explicitly disallow spaces in FOO. No big  
deal, but my natural inclination was to specify my lexer rules like  
this:

SPECIFIC_RULE : ....
LESS_SPECIFIC_RULE : ...
GENERAL_RULE : ...

And let "lexer precedence" sort out which one matches. This doesn't  
work, though, because if the a more general rule subsumes a more  
specific one, then the general rule will always win (a single greedy  
match) instead of yielding two smaller matches. In the end it looks  
like predicates in the lexer rules or some other workaround will have  
to step in.

> And as you are new with ANTLR I can recommend the following tutorial
> (which I incidentally wrote):
>
> http://www.antlr.org/wiki/display/ANTLR3/Quick+Starter+on+Parser 
> +Grammars+-+No+Past+Experience+Required

Yes, I had already read it, actually. It is a nice introduction to  
the topic! The main thing which I'm having trouble coming to grips  
with is achieving total separation between the lexer and the parser;  
my previous experience was with integrated lexer/parsers, so the  
lexer always knew exactly where it was and what kinds of symbols to  
look for in the current context. But in ANTLR the lexer has to do its  
scanning from start to finish without any help from the parser; I  
understand that you can get it to do what you want using predicates,  
but it's probably going to take me a while to get the hang of it.

Cheers,
Wincent