[antlr-interest] Over-eager lexer?

Mon Nov 30 15:27:40 PST 2009

Hi,

I'm trying to use ANTLR to extract portions of a text file, and I'm
having a strange lexer problem. I've boiled my problem down to a
pretty simple case:

I want to match the interesting segments of C-like input, as
delineated by the keyword 'interesting' and matching braces:

        /* blah blah blah - this part should be ignored */
        interesting { /* this is the part */ { /* that is matched */ } }
        /* also ignored */

Here's my ANTLR file:

        grammar test;

        root
        : ignored_segment (interesting_segment ignored_segment)*
        ;

        ignored_segment
        : ( ~ INTERESTING_KEYWORD )*
        ;

        interesting_segment
        : INTERESTING_KEYWORD brace_scope
        ;

        brace_scope
        : OPEN_BRACE (
          ( options {greedy=true;} : ~( OPEN_BRACE | CLOSE_BRACE )
          | brace_scope )
        )* CLOSE_BRACE
        ;

        WS
        : (' '|'\t'|'\n'|'\r')+
        {
                $channel=HIDDEN;
        };

        INTERESTING_KEYWORD : 'interesting' ;
        OPEN_BRACE : '{' ;
        CLOSE_BRACE : '}' ;
        UNMATCHED : . ;

When I run the grammar on the following input, I get the expected behavior.

        humdrum
        interesting { xxx }
        humdrum

However, running a slightly different input through the ANTLRWorks
debugger (or C runtime generated code) gives an error:

        boring
        interesting { xxx }
        boring

I get the following lexer complaint in the debugger output:

        line 1:5 mismatched character 'g' expecting 't'

It's like the lexer sees the 'in' in 'boring' and then refuses to give
up trying to match an 'interesting' token. Can someone explain why
this is happening, and how to solve it? I realize I'm kinda abusing
the lexer/parser, but the grammar seems like the best way to
accomplish my goal.

Thanks!
	Michael

PS. I’m vaguely aware of the concept of filter lexers, but I don’t
think I can do the brace matching I need with them? Also, I can’t
meaningfully test them in ANTLRWorks, since it doesn’t show the lexer
results. I rely on ANTLRWorks heavily to author my grammars before
running them in the ANTLR C Runtime.