[antlr-interest] Over-eager lexer?

Mon Nov 30 15:49:35 PST 2009

The 'in' in boring is enough for the lexer to decide you are trying to write interesting and when it gets to the 'g' it finds your lexer rule is in error. Because there is nothing else to consume such a word, you cannot use a predicate as ANTLR will see that it does not need the predicate.

If you do this though:

grammar T;

 root
: ignored_segment (interesting_segment ignored_segment)*
;
ignored_segment
: ( ~ INTERESTING_KEYWORD )*
;

interesting_segment
: INTERESTING_KEYWORD brace_scope
;

brace_scope
: OPEN_BRACE 
   (
        ~( OPEN_BRACE | CLOSE_BRACE )
        | brace_scope 
   )* 
  CLOSE_BRACE
;

INTERESTING_KEYWORD : 'interesting' ;
WORDS               : ('a'..'z')+  {$type = UNMATCHED; } ;
OPEN_BRACE          : '{' ;
CLOSE_BRACE : '}' ;
WS
: (' '|'\t'|'\n'|'\r')+
{
$channel=HIDDEN;
};

UNMATCHED : . ;

You will see that it does more checking before selecting the INTERESTING_KEYWORD rule.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Michael Coupland
> Sent: Monday, November 30, 2009 3:28 PM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Over-eager lexer?
> 
> Hi,
> 
> I'm trying to use ANTLR to extract portions of a text file, and I'm
> having a strange lexer problem. I've boiled my problem down to a
> pretty simple case:
> 
> I want to match the interesting segments of C-like input, as
> delineated by the keyword 'interesting' and matching braces:
> 
>         /* blah blah blah - this part should be ignored */
>         interesting { /* this is the part */ { /* that is matched */ }
> }
>         /* also ignored */
> 
> 
> Here's my ANTLR file:
> 
>         grammar test;
> 
>         root
>         : ignored_segment (interesting_segment ignored_segment)*
>         ;
> 
>         ignored_segment
>         : ( ~ INTERESTING_KEYWORD )*
>         ;
> 
>         interesting_segment
>         : INTERESTING_KEYWORD brace_scope
>         ;
> 
>         brace_scope
>         : OPEN_BRACE (
>           ( options {greedy=true;} : ~( OPEN_BRACE | CLOSE_BRACE )
>           | brace_scope )
>         )* CLOSE_BRACE
>         ;
> 
>         WS
>         : (' '|'\t'|'\n'|'\r')+
>         {
>                 $channel=HIDDEN;
>         };
> 
>         INTERESTING_KEYWORD : 'interesting' ;
>         OPEN_BRACE : '{' ;
>         CLOSE_BRACE : '}' ;
>         UNMATCHED : . ;
> 
> When I run the grammar on the following input, I get the expected
> behavior.
> 
>         humdrum
>         interesting { xxx }
>         humdrum
> 
> 
> However, running a slightly different input through the ANTLRWorks
> debugger (or C runtime generated code) gives an error:
> 
>         boring
>         interesting { xxx }
>         boring
> 
> 
> I get the following lexer complaint in the debugger output:
> 
>         line 1:5 mismatched character 'g' expecting 't'
> 
> 
> It's like the lexer sees the 'in' in 'boring' and then refuses to give
> up trying to match an 'interesting' token. Can someone explain why
> this is happening, and how to solve it? I realize I'm kinda abusing
> the lexer/parser, but the grammar seems like the best way to
> accomplish my goal.
> 
> Thanks!
> 	Michael
> 
> PS. I'm vaguely aware of the concept of filter lexers, but I don't
> think I can do the brace matching I need with them? Also, I can't
> meaningfully test them in ANTLRWorks, since it doesn't show the lexer
> results. I rely on ANTLRWorks heavily to author my grammars before
> running them in the ANTLR C Runtime.
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address