[antlr-interest] When can .* be used? (was: Matching Last Line in ANTLR)

David-Sarah Hopwood david-sarah at jacaranda.org
Tue Aug 18 15:42:10 PDT 2009


[Note that the version I posted using NOTNEWLINE* solves the original
poster's problem (as would appending '\n' to the stream before lexing,
although that requires writing extra code). The discussion below is
just about details of why other approaches using .* don't work.]

Gavin Lambert wrote:
> At 09:05 19/08/2009, consiliens at gmail.com wrote:
>  >For testing I removed the .* and, while there are no errors, it
>  >still doesn't match b. as the token MC_INCORRECT unless there
>  >is a newline after it.
> [...]
>  >MC_QUESTION  : INT ('.'|')') ENDOFLINE;
>  >MC_INCORRECT : LETTER '.' ENDOFLINE;
>  >MC_CORRECT   : '*' MC_INCORRECT;
>  >
>  >fragment ENDOFLINE : NEWLINE | { input.LA(1) == EOF }?;
> 
> Are you using the debugger or the interpreter to test with?  The 
> interpreter doesn't execute predicates, so it won't work properly; 
> you need to use the debugger.

Right.

> It also might pay to try a few variations on the ENDOFLINE rule; 
> sometimes ANTLR seems to ignore predicates if it thinks that 
> they're not accomplishing anything.  Try this, for example:
> 
> fragment ENDOFLINE : { input.LA(1) == EOF }? => | NEWLINE ;
> 
> or this:
> 
> fragment ENDOFLINE : NEWLINE | EOF ;

ENDOFLINE can indeed be simplified to NEWLINE | EOF.

However, that won't help because it is not the predicate that
causes the problem here; it's the fact that the match immediately
following .* uses the '|' operator. Note that it doesn't matter
whether this match is "inlined" or in a separate fragment rule
(and it also doesn't matter whether (option { greedy=false; } : .)*
is used instead of .*).

For instance, this version of MC_QUESTION still produces the warning:

MC_QUESTION :
  INT ('.'|')') (options { greedy=false; } : .)* ('\r'? '\n' | EOF);

Either of these work, and do not warn (but do not accept end-of-file):

MC_QUESTION : INT ('.'|')') .* NEWLINE;
or
MC_QUESTION : INT ('.'|')') .* ('\r'? '\n');

but this produces the warning:

MC_QUESTION : INT ('.'|')') .* ('\r' '\n' | '\n');

even though you would normally expect ('\r'? '\n') to be equivalent to
('\r' '\n' | '\n').

Therefore, .* can't be used in cases where the match following it
necessarily involves an alternation that can't be expressed using '?'.

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com



More information about the antlr-interest mailing list