[antlr-interest] Bug in DFA matching?

Mon Feb 9 16:04:30 PST 2009

On Mon, Feb 9, 2009 at 6:21 PM, C. Scott Ananian <cscott at cscott.net> wrote:
> On Mon, Feb 9, 2009 at 3:17 PM, Jim Idle <jimi at temporal-wave.com> wrote:
>> C. Scott Ananian wrote:
>>
>> I have a grammar for a configuration file where indentation is
>> significant, as in Python.  It contains the following lexer rules:
>>
>> WS
>>   : {getCharPositionInLine()!=1}? // not start-of-line whitespace
>>   ( ' ' | TAB )
>>     { $channel=HIDDEN; }
>>     ;
>> // whitespace at start of line used for INDENT processing
>> INITIAL_WS
>>       : {getCharPositionInLine()==1 && !afterIndent}? // at start of line.
>>       ( ' ' | TAB )*
>>     { this.afterIndent=true; }
>>     ;
>> First try a gated predicate rather than straight semantic predicate and fix
>> your INITIAL_WS so that it does not match a completely empty sequence (+ not
>> *) :
>>
>> {getCharPositionInLine()!=1}?=>
>
> The gated predicate did the trick, thanks!  (The * to + change wasn't
> necessary; ANTLR's fine with matching an empty sequence, as long as
> you use a gated predicate I guess.)

(In case anyone is following along at home) ANTLR doesn't have a
problem with matching an empty sequence, but the lexer does try to use
the longest match available (not the first matching rule).  This
doesn't seem to be clearly documented anywhere, and it's a change from
ANTLRv2 (or a bug?).  The gated predicate works fine, but you need to
add {!afterIndent}?=> to all the other rules in your lexer to ensure
that the zero-length INITIAL_WS rule actually matches in preference to
all the other rules, since even a one-character match from another
rule will prevent INITIAL_WS from being chosen.
 --scott

-- 
                         ( http://cscott.net/ )