[antlr-interest] Lexer rules and unreachable alternatives (trying to understand lexer)
Johannes Luber
jaluber at gmx.de
Thu Apr 19 05:38:22 PDT 2007
Wincent Colaiuta wrote:
> Ok, the funny thing is that there are no other rules at all. I made a
> lexer with that single rule in it because I was trying to figure out
> what it did under the covers... Given that no ambiguity is possible with
> only one rule, I wonder if ANTLR has a hard-coded response to lexer
> rules like ".+"...
>
> The thing which motivated me to start exploring this was a set of
> questions about lexer precedence (by which I mean, how the lexer chooses
> which rules to try) and I had a set of rules which looked something like
> this:
>
> WS : ' '+ ;
> FOO : ~('x' | 'y' | 'z')+ ;
Did you have a parser rule like
start
: WS
| FOO
;
in your example? Otherwise ANTLR may choose the superset rule in favor
of subset rule on its own.
> At first I mistakenly thought that the lexer would try lexer rules in
> order (WS first and then FOO), but it doesn't. It calls a predict method
> and the predication always goes for FOO without fail. My understanding
> is now that the prediction method favors a greedy match, and so even
> typing " \n" into the test rig is enough to make it prefer FOO over
> WS (because of the trailing newline). I played around with greedy=false
> but that yielded single characters rather than a string of
> non-whitespace characters. In any case, exploring the issue I eventually
> got down to a minimal lexer containing that lone OTHER rule...
With greedy=false you can't get token with more than one character,
unless you have some kind of stopping character. After all, the WS rule
is satisfied with just one space.
WS: ' '+ ~' ';
With this rule WS should get all consecutive spaces. But I haven't
tested if FOO is still chosen over WS. Maybe
start
: (WS)=> WS
| FOO
;
is still needed.
Best regards,
Johannes Luber
More information about the antlr-interest
mailing list