[antlr-interest] White spaces within token definition
Gavin Lambert
antlr at mirality.co.nz
Wed May 7 01:39:48 PDT 2008
At 19:30 7/05/2008, Haralambi Haralambiev wrote:
>CMD_EXIT:'COMMAND EXIT';
>ID:('A'..'Z'|'a'..'z')+;
>WhiteSpaces:(' '|'\t')+ {$channel=HIDDEN;};
>---------------------------------------------------
>
>Consider that the language that is recognized has many commands
>with the syntax "COMMAND <name of the command>", but I am
>interested only in the exit command, so I consider "COMMAND EXIT"
>as a token.
>However, I would like "COMMAND <something else>" to be matched as
>the sequence of two ID tokens.
>
>With the grammar above, the "COMMAND EXIT" is successfully
>matched as a CMD_EXIT token, however "COMMAND XYZ" produces an
>error "line 1:8 mismatched character 'X' expecting 'E'" and what
>is left (only the character Z) is matched as ID.
>
>In the generated lexer class, in the mTokes() method I noticed
>that the lexer will consider everything that starts with "COMMAND
>" as the CMD_EXIT token. It just doesn't consider the characters
>in the token definition, that were after the white space (i.e.
>'E', 'X', 'I' and 'T') during the recognition.
>
>So, if you could enlighten me on why is this happening, I will be
>very grateful!
The reason is because up until it hits the whitespace, both
CMD_EXIT and ID are viable targets. Once it sees the whitespace,
it knocks ID out of the running and leaves only CMD_EXIT, so that
one "wins". It will never look further ahead to the rest of the
rule. (ID followed by WhiteSpaces followed by ID is not
considered, since single tokens are always preferred over multiple
ones.)
(I think it ought to work as is, and there's a bug filed to that
effect. Apparently it has something to do with not handling
follow sets and being a bit too generous with error recovery.)
The general workaround for this problem is to merge the rules,
using either syntactic or semantic predicates to disambiguate and
change the final token type:
fragment CMD_EXIT: 'COMMAND EXIT';
ID
: (CMD_EXIT) => CMD_EXIT { $type = CMD_EXIT; }
| ('A'..'Z'|'a'..'z')+
;
WS: (' '|'\t')+ {$channel=HIDDEN;};
The synpred here forces it to look through the entire character
sequence before taking the CMD_EXIT branch.
More information about the antlr-interest
mailing list