[antlr-interest] Lexer matching non-matching rule

Fri May 15 05:03:16 PDT 2009

Sorry if this is a stupid question, but if it is I hope it has a quick
stupid answer.

My ANTLR-generated lexer protests about unexpected characters in a
situation where it could have matched the input with other rules. My
understanding of ANTLR and other tools that generate lexers is that it
should use the rule that matches the most from the input, but here it
seems to pick a rule that does NOT match the input. Why does it do that?

Here is a pretty much minimal example that produces the problem:

=======================================================================
grammar Y;
options { output=AST; }

file
    : IDENT DOT EOF
    ;

IDENT:          ('a'..'z' | 'A'..'Z')+;
DOT:            '.';
WHITESPACE:     ('\f' | '\n' | '\r' | '\t' | ' ')+
                { $channel = HIDDEN; };

URL:            ('a'..'z') ('a'..'z' | '0'..'9' | '+' | '-' | '.')* ':'
                ~('\f' | '\n' | '\r' | '\t' | ' ')*;
=======================================================================

Here is a sample input:

=======================================================================
foo.
=======================================================================

I get the following complaint when using the lexer + parser:

line 1:4 mismatched character '\n' expecting ':'

But if I remove the URL lexer rule (which is not used by the parser),
the sample goes through just fine.

Can somebody explain this? Should not the lexer just bail out of the
URL rule when the input does not match it?

J'