[antlr-interest] Why does ANTLR generate code that will never call an OR'd alternative?

Sat Aug 21 01:00:39 PDT 2010

Kevin,

Thanks for taking the time to reply.  

I did have the predicate in the identifier rule, but it appears the wrong
way:

	identifier 
	:        {isToken(input.LT(1))}?  IDENTIFIER  | IDENTIFIER;

The above still produced code that would never call isToken.  The reason I
did it like above, I thought the predicate had to change the token type
(from the tokens section value to IDENTIFIER); therefore, the IDENTIFIER
after the predicate.

Per your email, I tried this:

	identifier 
	:        {isToken(input.LT(1))}?  | IDENTIFIER;

And, ANTLR generated code that would call isToken.  But, isToken could also
be called on the right side of the OR in the 'identifier' rule (see code
below).
But, worse:

1. The identifier rule doesn't work in the above form.  I get unexpected
token exceptions for using a tokens section token as what's meant to be
non-grammar keywords.

2. Check out this first "if" for a simple list of tokens...some checks are
for the value of the token (e.g. TOKEN1, TOKEN10) and others are for values
range checks (e.g. (LA30_0 >= TOKEN2 && LA30_0 <= TOKEN3).  The latter I
could understand, if it weren't for the fact TOKEN2 and TOKEN3 values are 5
and 6!  

            if ( (LA30_0 == TOKEN1 || (LA30_0 >= TOKEN2 && LA30_0 <= TOKEN3)
|| (LA30_0 >= TOKEN4 && LA30_0 <= TOKEN5) || (LA30_0 >= TOKEN6 && LA30_0 <=
TOKEN7) || (LA30_0 >= TOKEN8 && LA30_0 <= TOKEN9) || LA30_0 == TOKEN10 ||
LA30_0 == TOKEN11 || (LA30_0 >= TOKEN12 && LA30_0 <= TOKEN13)) )
            {
                alt30 = 1;
            }
            else if ( (LA30_0 == IDENTIFIER) )
            {
                int LA30_2 = input.LA(2);

                if ( ((isToken(input.LT(1)))) )
                {
                    alt30 = 1;
                }
                else if ( (true) )
                {
                    alt30 = 2;
                }
                else 
                {
                    NoViableAltException nvae_d30s2 =
                        new NoViableAltException("", 30, 2, input);

                    throw nvae_d30s2;
                }
            }
            else 
            {
                NoViableAltException nvae_d30s0 =
                    new NoViableAltException("", 30, 0, input);

                throw nvae_d30s0;
            }
            switch (alt30) 
            {
                case 1 :
                    // ... : {...}?
                    {
                    	root_0 = (object)adaptor.GetNilNode();

                    	if ( !((isToken(input.LT(1)))) ) 
                    	{
                    	    throw new FailedPredicateException(input,
"identifier", "isToken(input.LT(1))");
                    	}

                    }
                    break;
                case 2 :
                    // ... : IDENTIFIER
                    {
                    	root_0 = (object)adaptor.GetNilNode();

IDENTIFIER132=(IToken)Match(input,IDENTIFIER,FOLLOW_IDENTIFIER_in_identifier
1562); 
                    		IDENTIFIER132_tree =
(object)adaptor.Create(IDENTIFIER132);
                    		adaptor.AddChild(root_0,
IDENTIFIER132_tree);

                    }
                    break;

            }

The only form of the 'identifier' rule I got to work was this:

	identifier 
	:       
    	  (      'TOKEN1' 
    	  |      'TOKEN2'	
    	  |      'TOKEN3'
		...
    	  |      'TOKEN_ZILLION')   { input.LT(-1).Type = IDENTIFIER; }	
	  | 	  IDENTIFIER;

Now, I can use a tokens keyword in a way the parser won't throw an
exception:

	TOKEN1=TOKEN3

	And, 'TOKEN3' doesn't trip up the parser.
(For the above, the rule is:

	TOKEN1=identifier

Which never worked before if the right-side of the equal sign was a token in
the tokens section).

I don't like my solution, listing the tokens twice in the grammar file.
And, would love to know how a pro would solve it.  Initially,  if I
should/must taken all the tokens out of the tokens section and, perhaps,
make per-token rules for them???   

Regards,
Trober

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Kevin J. Cummings
Sent: Saturday, August 21, 2010 2:42 AM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Why does ANTLR generate code that will never
call an OR'd alternative?

On 08/21/2010 03:27 AM, Avid Trober wrote:
> Gerald,
> 
> Thank you very much for your reply.
> 
> There's no alt skipped message in the error log.
> 
> The 'isToken' rule was simply my attempt to have the parser check if the
> token was in the tokens { ... } section.  At runtime, I found the token
type
> to always be the value in the token { ... } section, even if I tried to
> change it:
> 
> 	isToken	:	{isToken(input.LT(1))}? IDENTIFIER;
> 
> But, 'isToken' would never get called via the generated code, e.g. 
> 
> 	identifier  :  isToken | IDENTIFIER;   // i.e. treat a token in the
> tokens section as an IDENTIFIER.

You need to move your semantic predicate.  The lookahead sees that
IDENTIFIER is the lookahead for both.  If you want it to go through
isToken, your need to move the semantic predicate to the "identifier" rule.

> Therefore, I modified my 'identifier' rule to have each tokens { ... }
value
> in it, e.g.
> 
> 	identifier:
> 		( 'TOKEN1', 'TOKEN2', ... 'TOKEN_ELEVENTYTEEN_THOUSAND' }  {
> input.LT(-1).Type = IDENTIFIER; }
> 		| IDENTIFIER;
> 
> And,  that worked.  That is, if I have "identifier" in the grammar
somewhere
> it will now accept an IDENTIFIER, as it always has, but also any 'TOKEN1',
> 'TOKEN2', etc. value found in tokens { ... }
> 
> Personally, I hate this.  It means I need *two* places in my grammar to
list
> the keywords, the tokens { ... } section AND the identifier rule.  I'm
sure
> there's some way to do it via an action, predicate, whatever.  
> 
> I went down this path due to this recommendation: " The author's
> recommendation is to use ordinary rules and the tokens command." at
>
http://www.antlr.org/wiki/display/ANTLR3/Quick+Starter+on+Parser+Grammars+-+
> No+Past+Experience+Required. 
> 
> It appears the tokens section is NOT the thing to do, perhaps rather to
have
> per-token rules, e.g. keyToken1, keyToken2, etc.  But, I can't rewrite
this
> grammar and risk breaking other things.  Perhaps I should in the future.
> Preferably, I simply like a way to scan thru the tokens, if found, note
it,
> then change the token type to IDENTIFIER - without listing all the tokens
> twice in the grammar.
> 
> Any suggestions very, very welcome. 
> 
> Regards,
> Trober
> 
> 
> 
> 
> -----Original Message-----
> From: Gerald Rosenberg [mailto:gerald at certiv.net] 
> Sent: Saturday, August 21, 2010 1:35 AM
> To: Avid Trober
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Why does ANTLR generate code that will never
> call an OR'd alternative?
> 
>   Most likely, the parser generation analysis determined that isToken 
> can never be reached.  Check your error log for an alt skipped message.
> 
> 
> 
> ------ Original Message (Saturday, August 21, 2010 1:01:20 
> AM) From: Avid Trober ------
> Subject: [antlr-interest] Why does ANTLR generate code that will never
call
> an OR'd alternative?
>> For this rule,
>>
>>
>>
>> identifier
>>
>>                  :       isToken | IDENTIFIER;
>>
>>
>>
>> ANTLR generates code that would never calls the isToken rule
>> (target=CSharp2):
>>
>>
>>
>>      public MYParser.identifier_return identifier()    // throws
>> RecognitionException [1]
>>
>>      {
>>
>> .
>>
>>              // .  : ( isToken | IDENTIFIER )
>>
>>              int alt30 = 2;
>>
>>              int LA30_0 = input.LA(1);
>>
>>
>>
>>              if ( (LA30_0 == IDENTIFIER) )   //<== token must be
> IDENTIFIER
>> to call isToken???
>>
>>              {
>>
>>                  int LA30_1 = input.LA(2);
>>
>>
>>
>>                  if ( ((isToken(input.LT(1)))) )  //<== why must LA30_0
==
>> IDENTIFIER to call isToken?
>>
>>                  {
>>
>>                      alt30 = 1;
>>
>>                  }
>>
>>                  else if ( (true) )
>>
>>                  {
>>
>>                      alt30 = 2;
>>
>>                  }
>>
>> .
>>
>>              else                         //<== since not IDENTIFIER, why
>> not call isToken here???
>>
>>              {
>>
>>                  NoViableAltException nvae_d30s0 =
>>
>>                      new NoViableAltException("", 30, 0, input);
>>
>>
>>
>>                  throw nvae_d30s0;
>>
>>              }
>>
>>
>>
>> I would think it's something to do with DFA optimization?   Perhaps
that's
>> why IDENTIFIER is checked first.
>>
>> But, if IDENTIFIER is false, why not call isToken???    Afterall, the
rule
>> is IDENTIFIER  ****OR***** isToken.
>>
>>
>>
>> Thanks,
>>
>> Trober
>>
>>
>>
>>
>>
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
> 
> 

-- 
Kevin J. Cummings
kjchome at rcn.com
cummings at kjchome.homeip.net
cummings at kjc386.framingham.ma.us
Registered Linux User #1232 (http://counter.li.org)

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address