[antlr-interest] Lexer ambiguities

Mon Feb 11 06:03:39 PST 2008

On Feb 11, 2008 6:05 AM, Gavin Lambert <antlr at mirality.co.nz> wrote:
> At 11:00 11/02/2008, Mark Volkmann wrote:
>  >>   a : NUMBER UNIT ;
>  >>   b : VALUE NAME ;
>  >>
>  >>   NUMBER : ('0'..'9')+ ;
>  >>   UNIT : 'kg'  | 'lb' ;
>  >>
>  >>   VALUE : '0' | '1' ;
>  >>    NAME : ('!'..'~')+ ;
>  >>
>  >> How can I distinguish between a NUMBER and a VALUE and between
> a
>  >> UNIT and a NAME?
>  >
>  >I believe the key is that the order of lexer rules is
> significant.
>
> That's true, but...
>
>  >You need to put the VALUE rule before the NUMBER rule
>  >and the UNIT rule before the NAME rule
>
> That's not.
>
> The trouble here is that you're both thinking (or at least that's
> what it sounds like) that the parser is choosing the lexer rules
> it wants to look at, which is not the case.

I wasn't thinking that, but I was confused about how the lexer decides
which lexer rule to apply next.

> Lexing happens as a completely independent first step; the
> character stream is scanned and any non-fragment lexer rules are
> considered as possible candidates for generated tokens.  Of those,
> generally speaking the token match that consumes the most input
> "wins",

Ah, I didn't realize that. Why did you say "generally"? Do you know of
some exceptions to this?

> but failing that the first listed rule wins.

Why would it fail? Is it only because multiple lexer rules might match
the same number of characters?

> And all of
> this happens before a single parser rule is evaluated.
>
> So in the example above, swapping the rules will work for input
> like "1 bob" and "24 kg", but will fail on "1 kg", since that's
> VALUE UNIT and that doesn't match any of the parser rules.
>
> Two options:
>
> 1. remove the VALUE rule entirely (changing rule "b" to use a
> NUMBER as well) and either add a validation predicate to check the
> range of number entered is valid within the grammar or leave that
> to semantic checks outside the grammar.
>
> 2. change rule "a" to accept both NUMBERs and VALUEs.  (And swap
> them as Mark suggested.)

I tried option #2. My grammar is below, but it doesn't work with the
following input.

1Mark
19kg

Any idea why?

grammar NumberValue;

file: (line terminator)*;
line: a | b;
a: (VALUE | NUMBER) UNIT;
b: VALUE NAME;

VALUE: '0' | '1';
NUMBER: '0'..'9'+;

UNIT: 'kg' | 'lb';
NAME: '!'..'~'+;

terminator: NEWLINE | EOF;
NEWLINE: ('\r'? '\n')+;
WHITESPACE: (' ' | '\t')+ { $channel = HIDDEN; };

-- 
R. Mark Volkmann
Object Computing, Inc.