[antlr-interest] White spaces within token definition

Jim Idle jimi at temporal-wave.com
Wed May 7 08:05:27 PDT 2008


The lexer will start down the COMMAND_EXIT path, consume the space, then find that the next sequence is not EXIT so you get the error. You need an alt in the COMMAND sequence to see just the COMMAND part, like this:

 

fragment COMMAND

: 'COMMAND';

 

COMMAND_EXIT

: COMMAND

    (

         (' EXIT')=> ' EXIT'

       | { $type = COMMAND; }

    )

;

ID : ('A'..'Z'|'a'..'z')+;

WS: (' '|'\t')+ {$channel=HIDDEN;};

 

This will produce COMMAND ID when  the next sequence is not ' EXIT';

 

Jim

 

 

 

From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Haralambi Haralambiev
Sent: Wednesday, May 07, 2008 12:30 AM
To: antlr-interest at antlr.org
Cc: parrt at cs.usfca.edu
Subject: Re: [antlr-interest] White spaces within token definition

 

Hello,

Is this question too newbie, or is there noone that could answer it?

Could someone please give me some insight on the problem, as I do want to understand the cause and not work around the issue.

Thanks,
Hari

On 4/25/08, Haralambi Haralambiev <hharalambiev at gmail.com> wrote:

Hello,

I have stumbled upon a problem, that although has some workarounds, has puzzled me over why it is happening.
(I searched for a similar question, but was unable to find it. I am sorry if this has been answered somewhere else. If so, please provide me the link.)

Consider the following lexer grammar:
---------------------------------------------------
lexer grammar test;

CMD_EXIT : 'COMMAND EXIT';
ID : ('A'..'Z'|'a'..'z')+;
WhiteSpaces : (' '|'\t')+ {$channel=HIDDEN;};
---------------------------------------------------

Consider that the language that is recognized has many commands with the syntax "COMMAND <name of the command>", but I am interested only in the exit command, so I consider "COMMAND EXIT" as a token.
However, I would like "COMMAND <something else>" to be matched as the sequence of two ID tokens.

With the grammar above, the "COMMAND EXIT" is successfully matched as a CMD_EXIT token, however "COMMAND XYZ" produces an error "line 1:8 mismatched character 'X' expecting 'E'" and what is left (only the character Z) is matched as ID.

In the generated lexer class, in the mTokes() method I noticed that the lexer will consider everything that starts with "COMMAND " as the CMD_EXIT token. It just doesn't consider the characters in the token definition, that were after the white space (i.e. 'E', 'X', 'I' and 'T') during the recognition.

So, if you could enlighten me on why is this happening, I will be very grateful!

Best Regards,
Hari

 



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080507/94e02fd3/attachment.html 


More information about the antlr-interest mailing list