[antlr-interest] Bug in DFA matching?

Mon Feb 9 11:55:45 PST 2009

I have a grammar for a configuration file where indentation is
significant, as in Python.  It contains the following lexer rules:

WS
  : {getCharPositionInLine()!=1}? // not start-of-line whitespace
  ( ' ' | TAB )
    { $channel=HIDDEN; }
    ;
// whitespace at start of line used for INDENT processing
INITIAL_WS
	: {getCharPositionInLine()==1 && !afterIndent}? // at start of line.
	( ' ' | TAB )*
    { this.afterIndent=true; }
    ;

Note the star in the INITIAL_WS rule, which means that *every* line
should emit an INITIAL_WS token, possibly matching nothing, before
matching anything else.

The generated DFA contains the following code:
                    case 0 :
                        int LA10_25 = input.LA(1);

                        int index10_25 = input.index();
                        input.rewind();
                        s = -1;
                        if ( ((getCharPositionInLine()!=1)) ) {s = 26;}

                        else if ( ((getCharPositionInLine()==1 &&
!afterIndent)) ) {s = 6;}

                        input.seek(index10_25);
                        if ( s>=0 ) return s;
                        break;

which seems to be "obviously wrong" -- getCharPosition is going to be
evaluated in the rewound state, and then we're going to advance the
input and return, which will then invoke the proper lexer rule and
re-evaluate getCharPostion() in the *advanced* state, not where the
DFA evaluated it.

I don't quite understand the DFA well enough yet to attempt a proper
fix.  Anyone want to lend a hand?

Thanks--
 --scott

-- 
                         ( http://cscott.net/ )