[antlr-interest] Lex Matching Issues

Mon Jul 19 09:19:34 PDT 2010

Greetings!
On Mon, 2010-07-19 at 09:52 -0600, Cid Dennis wrote:
> So I am new to ANTLR and have created a grammar but found a strange issue.  Because of the structure of the language I am parsing there can be tokens that match reserved works as variables but only when they are in a sub rule that does not use the reserved word.
> 
> In the example below "ruleset" is seen by the parser in two different ways.  The first is for the 'ruleset' token and the second is as a VAR token.  The problem is when the parser sees the second ruleset it is thinking the token is the "ruleset" token not the "VAR" token so it returns Mismatch token exception.  
> 
> How can I make it so that I can do this kind of parsing.   One work around I came up with was to change 'ruleset' in the grammar to be a VAR  but then it is not easy to see what the grammar looks like.  
> 
> In the end I do not care what the token is considered(VAR or 'ruleset') as long as the parser does the right thing and can parse the "assignment" if 'ruleset' is used on the left hand side of the assignment.   
> 
> 
> Simple Example Input:
> 
> ruleset joe {
> 	rule myrulename is active {
> 		ruleset = "test";
> 	}	
> }
> 
> Simple Grammer:
> 
> grammar test;
> options {
>   output=AST;
> }
> 
> ID  :	('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_')*
>     ;
> 
> COMMENT
>     :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>     |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>     ;
> 
> WS  :   ( ' '
>         | '\t'
>         | '\r'
>         | '\n'
>         ) {$channel=HIDDEN;}
>     ;
> 
> STRING
>     :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
>     ;
> 
> fragment
> EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
> 
> fragment
> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
> 
> fragment
> ESC_SEQ
>     :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>     |   UNICODE_ESC
>     |   OCTAL_ESC
>     ;
> 
> fragment
> OCTAL_ESC
>     :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>     |   '\\' ('0'..'7') ('0'..'7')
>     |   '\\' ('0'..'7')
>     ;
> 
> fragment
> UNICODE_ESC
>     :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>     ;
>     
>     
> ruleset :	
> 	'ruleset' ID '{' rule* '}'
> 	;
> 	
> rule 	:
> 	'rule' ID 'is' ('active'|'inactive'|'test') '{' assignment* '}'
> 	;
> 
> 
> assignment :  
> 	ID '=' STRING ';'
> 	;
> 
> 	
> 

This is a fairly frequently asked question. Please try to search the
mail archives and/or the wiki at antlr.org.

One of the usual solutions, I believe, is to create a parser rule that
accepts your ID along with the keywords that are appropriate. So your
assignment rule would become something like (untested):

assignment : lhs '=' STRING ';' ;
lhs : ID | 'ruleset' /* other keyword alternatives go here */ ;

a down-side to this approach is that one has to be very careful to not
introduce ambiguities. probably by having a different parser rule for
each context - can get large and ugly...

another solution is to not have any keywords in the lexer but use parser
predicates to identify the keywords. I do not usually use predicates, so
I do not remember the specific meta-syntax, but it would be something
like:

ruleset : {$LA(1).text == "ruleset"}?=>ID ;
// and replace all 'ruleset' to refer to the ruleset rule instead

hope this helps...
   -jbb