[antlr-interest] Lexer rules and unreachable alternatives (trying to understand lexer)

Thu Apr 19 05:38:22 PDT 2007

Wincent Colaiuta wrote:
> Ok, the funny thing is that there are no other rules at all. I made a
> lexer with that single rule in it because I was trying to figure out
> what it did under the covers... Given that no ambiguity is possible with
> only one rule, I wonder if ANTLR has a hard-coded response to lexer
> rules like ".+"...
> 
> The thing which motivated me to start exploring this was a set of
> questions about lexer precedence (by which I mean, how the lexer chooses
> which rules to try) and I had a set of rules which looked something like
> this:
> 
> WS : ' '+ ;
> FOO : ~('x' | 'y' | 'z')+ ;

Did you have a parser rule like

start
   :  WS
   |  FOO
   ;

in your example? Otherwise ANTLR may choose the superset rule in favor
of subset rule on its own.

> At first I mistakenly thought that the lexer would try lexer rules in
> order (WS first and then FOO), but it doesn't. It calls a predict method
> and the predication always goes for FOO without fail. My understanding
> is now that the prediction method favors a greedy match, and so even
> typing "     \n" into the test rig is enough to make it prefer FOO over
> WS (because of the trailing newline). I played around with greedy=false
> but that yielded single characters rather than a string of
> non-whitespace characters. In any case, exploring the issue I eventually
> got down to a minimal lexer containing that lone OTHER rule...

With greedy=false you can't get token with more than one character,
unless you have some kind of stopping character. After all, the WS rule
is satisfied with just one space.

WS: ' '+ ~' ';

With this rule WS should get all consecutive spaces. But I haven't
tested if FOO is still chosen over WS. Maybe

start
   : (WS)=> WS
   |  FOO
   ;

is still needed.

Best regards,
Johannes Luber