[antlr-interest] A very basic grammar--and I'm confused!

Sat Aug 16 21:51:30 PDT 2008

At 14:42 17/08/2008, Richard Steele wrote:
 >grammar R;
 >
 >r	:	'X' ID;
 >ID	:	'A'..'Z'+;
 >
 >However, this returns a MismatchedTokenException for every
 >(alphabetic) input.
 >It appears that the 'X' is getting greedily swallowed by the
 >expression for ID, but I don't understand why that would be,
 >nor how to prevent it from happening.
 >(If I change the 'X' to '1', or anything else not in ID, then
 >it works as I expect.)

This is why it's dangerous to use literals in parser rules :)

The main thing you need to always keep in mind is that (unlike 
some other parser generators), ANTLR performs the entire lexing 
phase up front without any input from the parser.  The parser only 
gets a shot once everything has been turned into tokens.

Another thing in play here is that when you use a quoted literal 
in a parser rule, what you're really telling ANTLR to do is to 
generate a hidden lexer rule that matches this.  Putting it a 
different way, your grammar above is treated equivalently to this:

grammar R;

r : T4 ID;
T4 : 'X';
ID : 'A'..'Z'+;

The final piece of the puzzle is that given a choice between two 
tokens at lexing time, ANTLR will favour the longest match -- and 
once "inside" a token, it will not consider alternative 
interpretations.

So, putting this all together:
   "X" => T4["X"]
   "A" => ID["A"]
   "AX" => ID["AX"]
   "XYZ" => ID["XYZ"]
   "X YZ" => T4["X"] ID["YZ"] (with an error, since you don't have 
a whitespace rule)

Hopefully this all makes sense now :)

Where to go from here depends on exactly how your *real* grammar 
is structured (instead of the simplified example); you may need to 
merge the lexer rules and give it some explicit disambiguation -- 
or possibly just add a whitespace rule, if the 'X' is actually 
representing a keyword that must be surrounded by whitespace.