[antlr-interest] White spaces within token definition

Fri Apr 25 04:57:21 PDT 2008

Hello,

I have stumbled upon a problem, that although has some workarounds, has
puzzled me over why it is happening.
(I searched for a similar question, but was unable to find it. I am sorry if
this has been answered somewhere else. If so, please provide me the link.)

Consider the following lexer grammar:
---------------------------------------------------
lexer grammar test;

CMD_EXIT : 'COMMAND EXIT';
ID : ('A'..'Z'|'a'..'z')+;
WhiteSpaces : (' '|'\t')+ {$channel=HIDDEN;};
---------------------------------------------------

Consider that the language that is recognized has many commands with the
syntax "COMMAND <name of the command>", but I am interested only in the exit
command, so I consider "COMMAND EXIT" as a token.
However, I would like
"COMMAND <something else>" to be matched as the sequence of two ID tokens.

With the grammar above, the "COMMAND EXIT" is successfully matched as
a CMD_EXIT token, however "COMMAND XYZ" produces an error "line
1:8 mismatched character 'X' expecting
'E'" and what is left (only the character Z) is matched as ID.

In the generated lexer class, in the mTokes() method I noticed that
the lexer will consider everything that starts with "COMMAND " as the
CMD_EXIT
token.
It just doesn't consider the characters in the token definition, that
were after the white space (i.e. 'E', 'X', 'I' and 'T') during the
recognition.

So, if you could enlighten me on why is this happening, I will be very
grateful!

Best Regards,
Hari
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080425/7289f246/attachment-0001.html