[antlr-interest] Lexer code not generated as expected?

Tue Dec 15 07:10:33 PST 2009

Hello,

I have found out a strange problem using Antlr and I wonder if it is a bug or not.
Here is part of my grammar:

WS  
    : ' ' {$channel=HIDDEN;} 
    ;

CUTLINE
    : ('\n' ' '* '+') {$channel=HIDDEN;} 
    ;

NEWLINE
    : '\n' 
    ;

and here is what antlr generates in the function mTokens:

static void 
mTokens(pAntlrTestbenchLexer ctx)
{
    {
        //  antlr/AntlrTestbench.g:1:8: ( T__10 | WS | CUTLINE | NEWLINE | ID | INT )

        ANTLR3_UINT32 alt4;

        alt4=6;

        switch ( LA(1) ) 
        {
...
        case '\n':
        	{
        		switch ( LA(2) ) 
        		{
        		case ' ':
        		case '+':       
        			{
        				alt4=3; //CUTLINE
        			}
        		    break;

        		default:
        		    alt4=4;}            //NEWLINE

        	}
            break;

...

It doesn't correspond to what I want because when the input of the lexer is "\n ", I would expect it to recognize the lexemes NEWLINE and WS, but with the code above it will try to recognize the lexeme CUTLINE and fail.
Indeed, when a '\n' has been first recognized, the lexer should look ahead to find the first non ' ' character, and then if it is a '+' character, OK the correct alternative is the CUTLINE rule, if not then only in this case the correct alternative is the NEWLINE rule.

The workarounbd I have found is to change the grammar this way:

NEWLINE
    : '\n' ' '*
    ;

Then it is working as I want, but I find it strange having to resolve the ambiguity this way.
So is the C code generated by antlr correct or is it a bug?

Thanks,
Yann

____________________________________________________

Venez faire le plein d’idées et remplir votre hotte de cadeaux sur http://evenementiel.voila.fr/Noel/