[antlr-interest] White spaces within token definition

Wed May 7 10:27:08 PDT 2008

When I try it with the following grammar, I do not get this "COMMAND 
EXIT" recognised.

With input:  one two COMMAND EXIT four, the result is
ID one
ID two
ID COMMAND
ID EXIT
ID four

Is there something missing from the code?

Simos

=======
grammar SpaceIssue;

options { language = Python; }

statement       : atoken* EOF ;

atoken  :       
        COMMAND_EXIT    { print "COMMAND", $COMMAND_EXIT.text   }
        | ID            { print "ID", $ID.text          } ;

fragment COMMAND : 'COMMAND' ;

COMMAND_EXIT    : COMMAND ( ('EXIT') => 'COMMAND'| { $type = COMMAND; } ) ;

ID  : ('A'..'Z'|'a'..'z')+ ;

WS: (' '|'\t')+ { $channel=HIDDEN; } ;
=======

O/H Jim Idle έγραψε:
>
> The lexer will start down the COMMAND_EXIT path, consume the space, 
> then find that the next sequence is not EXIT so you get the error. You 
> need an alt in the COMMAND sequence to see just the COMMAND part, like 
> this:
>
>  
>
> fragment COMMAND
>
> : 'COMMAND';
>
>  
>
> COMMAND_EXIT
>
> : COMMAND
>
>     (
>
>          (' EXIT')=> ' EXIT'
>
>        | { $type = COMMAND; }
>
>     )
>
> ;
>
> ID : ('A'..'Z'|'a'..'z')+;
>
> WS: (' '|'\t')+ {$channel=HIDDEN;};
>
>  
>
> This will produce COMMAND ID when  the next sequence is not ' EXIT';
>
>  
>
> Jim
>
>  
>
>  
>
>  
>
> *From:* antlr-interest-bounces at antlr.org 
> [mailto:antlr-interest-bounces at antlr.org] *On Behalf Of *Haralambi 
> Haralambiev
> *Sent:* Wednesday, May 07, 2008 12:30 AM
> *To:* antlr-interest at antlr.org
> *Cc:* parrt at cs.usfca.edu
> *Subject:* Re: [antlr-interest] White spaces within token definition
>
>  
>
> Hello,
>
> Is this question too newbie, or is there noone that could answer it?
>
> Could someone please give me some insight on the problem, as I 
> do want to understand the cause and not work around the issue.
>
> Thanks,
> Hari
>
> On 4/25/08, *Haralambi Haralambiev* <hharalambiev at gmail.com 
> <mailto:hharalambiev at gmail.com>> wrote:
>
> Hello,
>
> I have stumbled upon a problem, that although has some workarounds, 
> has puzzled me over why it is happening.
> (I searched for a similar question, but was unable to find it. I am 
> sorry if this has been answered somewhere else. If so, please provide 
> me the link.)
>
> Consider the following lexer grammar:
> ---------------------------------------------------
> lexer grammar test;
>
> CMD_EXIT : 'COMMAND EXIT';
> ID : ('A'..'Z'|'a'..'z')+;
> WhiteSpaces : (' '|'\t')+ {$channel=HIDDEN;};
> ---------------------------------------------------
>
> Consider that the language that is recognized has many commands with 
> the syntax "COMMAND <name of the command>", but I am interested only 
> in the exit command, so I consider "COMMAND EXIT" as a token.
> However, I would like 
> "COMMAND <something else>" to be matched as the sequence of two ID tokens.
>
> With the grammar above, the "COMMAND EXIT" is successfully matched as a CMD_EXIT token, however "COMMAND XYZ" produces an error "line 
> 1:8 mismatched character 'X' expecting 
> 'E'" and what is left (only the character Z) is matched as ID.
>
> In the generated lexer class, in the mTokes() method I noticed that the lexer will consider everything that starts with "COMMAND " as the CMD_EXIT 
> token. 
> It just doesn't consider the characters in the token definition, that were after the white space (i.e. 'E', 'X', 'I' and 'T') during the recognition.
>
> So, if you could enlighten me on why is this happening, I will be very 
> grateful!
>
> Best Regards,
> Hari
>
>  
>