[antlr-interest] Fwd: Why is this ambiguous?

Fri Jan 5 06:42:30 PST 2007

Thanks Jose,

Jose Ventura wrote:
> Hi Martin,
>  
> You can see with an example why is ambiguous.
>  
> With the stream "+1" the lexer can make:
>  
> - IDENTIFIER(+) INT(1) <-- This solution is possible because the '+' of 
> int is optional.
> - INT(+1)

Thanks, I mentioned this in my original email.  It's also true that the 
stream "254" is ambiguous:

- INT(254)
- INT(25) INT(4)
- INT(2) INT(54)
- INT(2) INT(5) INT(4)

The reason this isn't considered ambiguous is because it matches the 
longest possible string.

Is the "longest match" rule only used for choosing what to assign to a 
single token, and not to choose between tokens or something?

> There're two solutions.
>  
>  
> Maybe, you can try:
>  
> INT_IDENTIFIER
>     : '+' {$setType(IDENTIFIER);} ( ('0'..'9')+ {$setType(INT);}
>                                                | ('a'..'z')*
>                                                )
> ;
>  
> INT: ('-')? ('0'..'9')+ ;

Thanks, perhaps I'll give that a go.

- Martin

>  
> I think this run ok, but you must check it.
>  
> Regards,
> José Ventura
>    
> ---------- Forwarded message ----------
> From: *Martin C. Martin* <martin at martincmartin.com 
> <mailto:martin at martincmartin.com>>
> Date: 05-ene-2007 2:24
> Subject: [antlr-interest] Why is this ambiguous?
> To: antlr-interest at antlr.org <mailto:antlr-interest at antlr.org>
> 
> Hi,
> 
> First of all, thanks for Antlr, it's a huge help!
> 
> But I don't understand why the following dead-simple lexer is ambiguous:
> 
> class MyLexer extends Lexer;
> 
> options {
>    k=4;
> }
> 
> IDENTIFIER: "+" ;
> 
> INT : ('+' | '-')? ( '0'..'9' )+ ;
> 
> An INT must contain at least one digit, and an IDENTIFIER no digits.  So
> if I receive a + followed by any non-digit (including end of stream), it
> must be an identifier.  If I get a + followed by a digit, it must be an
> INT.  It can't be an IDENTIFIER followed by an INT, because when
> deciding what token to use for the +, it must match the longest
> sequence, and + followed by digits is longer than just + alone.
> 
> Am I missing something?  How do I make this non-ambiguous?  For the
> record, the error message is:
> 
> $ java antlr.Tool MyLexer.g
> ANTLR Parser Generator   Version 2.7.5 (20050128)   1989-2005 jGuru.com
> MyLexer.g: warning:lexical nondeterminism between rules IDENTIFIER and
> INT upon
> MyLexer.g:     k==1:'+'
> MyLexer.g:     k==2:<end-of-token>
> MyLexer.g:     k==3:<end-of-token>
> MyLexer.g:     k==4:<end-of-token>
> 
> Best,
> Martin
> 
>