[antlr-interest] Bug with number of Tokens in lexers? (was: XML QName Character Validation)

Mon Apr 7 01:44:24 PDT 2008

Hi,

> For NCName, I suggest you look only at the first character, then  
> accept anything which is not a delimiter (e.g. ":", space, angle  
> bracket, etc..  After the match, call a routine to check that the  
> match is a valid name This has two advantages:

The trouble is that only looking at the first character doesn't really  
help - I'm already in trouble with the first decision.

I think I'm running into either a undocumented (= unknown to me ;-))  
limitation of ANTLRs lexer generation, or a bug.

The attached lexer grammar is the lexical part of my XQuery grammar.  
I've commented out most of the tokens section, see the block comment  
starting at line 37.

The weird thing is that with those tokens up to line 36, everything  
works as expected. If I comment in one more token (e.g. include the  
EVERY token), ANTLR suddenly starts complaining about ambiguities in  
the file.

If I replace the complex letter rule with a simpler 'a'..'z' |  
'A'..'Z', everything works fine again and I can have (apparently) as  
much tokens as I want.

How does this happen? Is there a limit to the number of decisions in a  
lexer?

Thanks for your help,
Martin