[antlr-interest] ANTLR Problem When a Token Name is a Prefix of Another Token Name

Mon May 17 20:31:57 PDT 2010

On May 17, 2010, at 5:00 PM, Sameh W. Zaky wrote:

> Hey all,
> 
> In the following simple grammar:
> 
> *start* : *ANIMAL* ('or' *ANIMAL*)* 'and' *SERVICE EOF* ;
> 
> *ANIMAL* : ('dog' | 'cat' | 'horse') ;
> *SERVICE* : ('dog hardware' | 'software') ;
> 
> NOTICE: 'dog' is a proper prefix of 'dog hardware'..
> ======================================================
> 
> *When I run this grammar by giving an input sentence, something goes wrong
> whenever I use the token 'dog'..*
> "dog and software" --> "dog and" disappears in the input box, and also in
> the tree
> "dog or cat or software"  --> "dog or" disappears in the input box, and also
> in the tree
> "cat or dog and software" --> "dog and" disappears..
> 
> *While there is no problem with the token 'dog hardware'*
> "cat and dog hardware" --> works fine..
> 
> I know the reason.. It's because the grammar is confused when one token is a
> proper prefix of another token.. So the token with the bigger length works
> fine while the other one doesn't..
> 
> Any solution to this problem? (Other than changing the name of the token
> because in my real grammar I really need the token names to stay as they
> are)
<snip>

Sameh:

Resolving this problem requires lookahead in a form more commonly used in syntactic analysis and not lexical analysis. Is there really a good reason why they have to be lexical elements (tokens) and not syntactic elements (productions)? What will you do if the input has two spaces between the words in your token e.g., between 'dog' and 'hardware'?

For example why isn't the following acceptable?

*start* : *animal* ('or' *animal*)* 'and' *service EOF* ;

*animal* : (DOG | CAT | HORSE) ;
*DOG* : 'dog';
*CAT* : 'cat';
*HORSE* : 'horse';
*service* : (DOG HARDWARE | SOFTWARE) ;
*HARDWARE* : 'hardware';
*SOFTWARE* : 'software';