[antlr-interest] White spaces within token definition

Gavin Lambert antlr at mirality.co.nz
Wed May 7 01:39:48 PDT 2008


At 19:30 7/05/2008, Haralambi Haralambiev wrote:
>CMD_EXIT:'COMMAND EXIT';
>ID:('A'..'Z'|'a'..'z')+;
>WhiteSpaces:(' '|'\t')+ {$channel=HIDDEN;};
>---------------------------------------------------
>
>Consider that the language that is recognized has many commands 
>with the syntax "COMMAND <name of the command>", but I am 
>interested only in the exit command, so I consider "COMMAND EXIT" 
>as a token.
>However, I would like "COMMAND <something else>" to be matched as 
>the sequence of two ID tokens.
>
>With the grammar above, the "COMMAND EXIT" is successfully 
>matched as a CMD_EXIT token, however "COMMAND XYZ" produces an 
>error "line 1:8 mismatched character 'X' expecting 'E'" and what 
>is left (only the character Z) is matched as ID.
>
>In the generated lexer class, in the mTokes() method I noticed 
>that the lexer will consider everything that starts with "COMMAND 
>" as the CMD_EXIT token. It just doesn't consider the characters 
>in the token definition, that were after the white space (i.e. 
>'E', 'X', 'I' and 'T') during the recognition.
>
>So, if you could enlighten me on why is this happening, I will be 
>very grateful!

The reason is because up until it hits the whitespace, both 
CMD_EXIT and ID are viable targets.  Once it sees the whitespace, 
it knocks ID out of the running and leaves only CMD_EXIT, so that 
one "wins".  It will never look further ahead to the rest of the 
rule.  (ID followed by WhiteSpaces followed by ID is not 
considered, since single tokens are always preferred over multiple 
ones.)

(I think it ought to work as is, and there's a bug filed to that 
effect.  Apparently it has something to do with not handling 
follow sets and being a bit too generous with error recovery.)

The general workaround for this problem is to merge the rules, 
using either syntactic or semantic predicates to disambiguate and 
change the final token type:

fragment CMD_EXIT: 'COMMAND EXIT';
ID
   :  (CMD_EXIT) => CMD_EXIT { $type = CMD_EXIT; }
   |  ('A'..'Z'|'a'..'z')+
   ;
WS: (' '|'\t')+ {$channel=HIDDEN;};

The synpred here forces it to look through the entire character 
sequence before taking the CMD_EXIT branch.



More information about the antlr-interest mailing list