[antlr-interest] Lexer ambiguities

Mon Feb 11 04:05:41 PST 2008

At 11:00 11/02/2008, Mark Volkmann wrote:
 >>   a : NUMBER UNIT ;
 >>   b : VALUE NAME ;
 >>
 >>   NUMBER : ('0'..'9')+ ;
 >>   UNIT : 'kg'  | 'lb' ;
 >>
 >>   VALUE : '0' | '1' ;
 >>    NAME : ('!'..'~')+ ;
 >>
 >> How can I distinguish between a NUMBER and a VALUE and between 
a
 >> UNIT and a NAME?
 >
 >I believe the key is that the order of lexer rules is 
significant.

That's true, but...

 >You need to put the VALUE rule before the NUMBER rule
 >and the UNIT rule before the NAME rule

That's not.

The trouble here is that you're both thinking (or at least that's 
what it sounds like) that the parser is choosing the lexer rules 
it wants to look at, which is not the case.

Lexing happens as a completely independent first step; the 
character stream is scanned and any non-fragment lexer rules are 
considered as possible candidates for generated tokens.  Of those, 
generally speaking the token match that consumes the most input 
"wins", but failing that the first listed rule wins.  And all of 
this happens before a single parser rule is evaluated.

So in the example above, swapping the rules will work for input 
like "1 bob" and "24 kg", but will fail on "1 kg", since that's 
VALUE UNIT and that doesn't match any of the parser rules.

Two options:

1. remove the VALUE rule entirely (changing rule "b" to use a 
NUMBER as well) and either add a validation predicate to check the 
range of number entered is valid within the grammar or leave that 
to semantic checks outside the grammar.

2. change rule "a" to accept both NUMBERs and VALUEs.  (And swap 
them as Mark suggested.)

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.20.2/1270 - Release Date: 10/02/2008 12:21