[antlr-interest] Why does ANTLR generate code that will never call an OR'd alternative?

Sat Aug 21 07:24:24 PDT 2010

On 08/21/2010 04:00 AM, Avid Trober wrote:
> Kevin,
> 
> Thanks for taking the time to reply.  
> 
> I did have the predicate in the identifier rule, but it appears the wrong
> way:
> 
> 	identifier 
> 	:        {isToken(input.LT(1))}?  IDENTIFIER  | IDENTIFIER;

Why can't you something like do:

identifier: i:IDENTIFIER
	{ if (isToken($i))
	    { // code here for the isToken case
	    }
	  else
	    { // code here (maybe empty) for the other case
            }
	}
	;

> The above still produced code that would never call isToken.  The reason I
> did it like above, I thought the predicate had to change the token type
> (from the tokens section value to IDENTIFIER); therefore, the IDENTIFIER
> after the predicate.
> 
> Per your email, I tried this:
> 
> 	identifier 
> 	:        {isToken(input.LT(1))}?  | IDENTIFIER;

This case won't match anything, so in order for isToken to be called,
the lookahead would have to *not* be an IDENTIFIER.

> And, ANTLR generated code that would call isToken.  But, isToken could also
> be called on the right side of the OR in the 'identifier' rule (see code
> below).
> But, worse:
> 
> 1. The identifier rule doesn't work in the above form.  I get unexpected
> token exceptions for using a tokens section token as what's meant to be
> non-grammar keywords.
> 
> 2. Check out this first "if" for a simple list of tokens...some checks are
> for the value of the token (e.g. TOKEN1, TOKEN10) and others are for values
> range checks (e.g. (LA30_0 >= TOKEN2 && LA30_0 <= TOKEN3).  The latter I
> could understand, if it weren't for the fact TOKEN2 and TOKEN3 values are 5
> and 6!  
> 
> 
>             if ( (LA30_0 == TOKEN1 || (LA30_0 >= TOKEN2 && LA30_0 <= TOKEN3)
> || (LA30_0 >= TOKEN4 && LA30_0 <= TOKEN5) || (LA30_0 >= TOKEN6 && LA30_0 <=
> TOKEN7) || (LA30_0 >= TOKEN8 && LA30_0 <= TOKEN9) || LA30_0 == TOKEN10 ||
> LA30_0 == TOKEN11 || (LA30_0 >= TOKEN12 && LA30_0 <= TOKEN13)) )
>             {
>                 alt30 = 1;
>             }
>             else if ( (LA30_0 == IDENTIFIER) )
>             {
>                 int LA30_2 = input.LA(2);
> 
>                 if ( ((isToken(input.LT(1)))) )
>                 {
>                     alt30 = 1;
>                 }
>                 else if ( (true) )
>                 {
>                     alt30 = 2;
>                 }
>                 else 
>                 {
>                     NoViableAltException nvae_d30s2 =
>                         new NoViableAltException("", 30, 2, input);
> 
>                     throw nvae_d30s2;
>                 }
>             }
>             else 
>             {
>                 NoViableAltException nvae_d30s0 =
>                     new NoViableAltException("", 30, 0, input);
> 
>                 throw nvae_d30s0;
>             }
>             switch (alt30) 
>             {
>                 case 1 :
>                     // ... : {...}?
>                     {
>                     	root_0 = (object)adaptor.GetNilNode();
> 
>                     	if ( !((isToken(input.LT(1)))) ) 
>                     	{
>                     	    throw new FailedPredicateException(input,
> "identifier", "isToken(input.LT(1))");
>                     	}
> 
>                     }
>                     break;
>                 case 2 :
>                     // ... : IDENTIFIER
>                     {
>                     	root_0 = (object)adaptor.GetNilNode();
> 
>  
> IDENTIFIER132=(IToken)Match(input,IDENTIFIER,FOLLOW_IDENTIFIER_in_identifier
> 1562); 
>                     		IDENTIFIER132_tree =
> (object)adaptor.Create(IDENTIFIER132);
>                     		adaptor.AddChild(root_0,
> IDENTIFIER132_tree);
> 
> 
>                     }
>                     break;
> 
>             }
> 
> 
> The only form of the 'identifier' rule I got to work was this:
> 
> 	identifier 
> 	:       
>     	  (      'TOKEN1' 
>     	  |      'TOKEN2'	
>     	  |      'TOKEN3'
> 		...
>     	  |      'TOKEN_ZILLION')   { input.LT(-1).Type = IDENTIFIER; }	
> 	  | 	  IDENTIFIER;
> 
> 
> Now, I can use a tokens keyword in a way the parser won't throw an
> exception:
> 
> 	TOKEN1=TOKEN3
> 
> 	And, 'TOKEN3' doesn't trip up the parser.
> (For the above, the rule is:
> 
> 	TOKEN1=identifier
> 
> Which never worked before if the right-side of the equal sign was a token in
> the tokens section).

In cases like this, I have done:

keyword : 'TOKEN1'
        | 'TOKEN2'
        | 'TOKEN3'
          ...
        | 'LAST_TOKEN'
        ;

identifier : IDENTIFIER
           | k:keyword
             { #k->setType(IDENITIFER); }
           ;

(OK, this is with ANTLR 2.7.7 and the C++ target...)  but it should be
similar with ANTLR 3.

> I don't like my solution, listing the tokens twice in the grammar file.
> And, would love to know how a pro would solve it.  Initially,  if I
> should/must taken all the tokens out of the tokens section and, perhaps,
> make per-token rules for them???   

I wouldn't use a semantic predicate for this, rather, I'd just clobber
the token type when I knew it was an identifier and not a keyword.

This question comes up rather often on this list.

> Regards,
> Trober

-- 
Kevin J. Cummings
kjchome at rcn.com
cummings at kjchome.homeip.net
cummings at kjc386.framingham.ma.us
Registered Linux User #1232 (http://counter.li.org)