[antlr-interest] Lexer Problem - ANTLR bug or my stupidity?

Mon Apr 16 11:42:44 PDT 2007

Terence, all,

could you please have a look at this simple lexer grammar, please? (It 
is a stripped-down version of a much more reasonable grammar :-) )

+++++++++++++++++++++++++++++++++++++

grammar Foo;

fragment CHAR :
      NON_SPECIAL_CHAR
    | OVERRIDER OVERRIDER
    ;

fragment NON_SPECIAL_CHAR :
    'a'
    ;

CHAR_STRING :
    CHAR ( CHAR )*
    ;

OVERRIDER :
    '#'
    ;

++++++++++++++++++++++++++++++++++++++++

IMHO, for a given input

   a##a#a

 FooLexer should output three tokens

  a##a
  #
  a

Unfortunately, it does not but raises an error instead. If you look at 
the generated lexer code, it becomes clear why:

    public final void mCHAR_STRING() throws RecognitionException {
        try {
            int _type = CHAR_STRING;
            // ReplicationTransaction.g:45:2: ( CHAR ( CHAR )* )
            // ReplicationTransaction.g:45:2: CHAR ( CHAR )*
            {
            mCHAR();
            // ReplicationTransaction.g:45:7: ( CHAR )*
            loop2:
            do {
                int alt2=2;
                int LA2_0 = input.LA(1);
                if ( (LA2_0=='#'||LA2_0=='a') ) {
                    alt2=1;
                }
                switch (alt2) {
                case 1 :
                    // ReplicationTransaction.g:45:9: CHAR
                    {
                    mCHAR();

                    }
                    break;
                default :
                    break loop2;
                }
            } while (true);
            }
            this.type = _type;
        }
        finally {
        }
    }

Once the lexer has entered mCHAR_STRING is uses a lookahead of 1! In 
other words, it does not check what's behind the # ! Since ANTLR claims 
to use LL(*), this looks like a real bug to me.

In any case, does anybody have a clever workaround?

A thousand thanks in advance!

Michael