[antlr-interest] Ignoring comments in predicates problem

Mon Jan 31 14:05:11 PST 2005

On Jan 30, 2005, at 11:48 AM, Paul J. Lucas wrote:
> Given:
>
>     protected Ignore
>         :   (   WhiteSpaceChar
>             |   "(:" ( options { greedy = false; } : . )* ":)"
>             )+
>         ;
>
>     protected Keywords
>         :   // ...
>         |   (Identifier (Ignore)? '(' ~':')=> Identifier {
>                 $setType( FUNCTION_NAME );
>             }
>         ;
>
> That is "Ignore" is used in predicates to ignore either whitespace or 
> comments
> -- a comment in XQuery is (: like this :)
>
> I do get a "nongreedy block may exit incorrectly due to limitations of 
> linear
> approximate lookahead" warning for "Ignore".

Hi Paul, I believe in this case the warning is overly careful.  As long 
as the follow set is exactly a single sequence of chars, it will always 
work.

> If I have an "Idenfitier" optionally followed by "Ignore" followed by 
> '(' but
> not followed by a ':', then I have a function name.  I want to handle 
> all the
> cases of:
>
>     foo( ...
>     foo ( ...
>     foo (: comment :) ( ...
>
> That is allow zero or more whitespaces or comments in between the 
> Identifier
> and the '('.  The second case above doesn't work.

I assume that the "foo (" is indeed not followed by a ':'.

> For the ANTLR-generated code for "Ignore" I get in part:
>
>     switch ( LA(1)) {
>     case '\t':  case '\n':  case '\r':  case ' ':
>     {
>         mWhiteSpaceChar(false);
>         break;
>     }
>     case '(':
>     {
>         match("(:");
>
> The execution enters the '(' case above, but then match() throws a
> RecognitionException because it doesn't match "(:".  Back in the 
> "Keywords"
> ANTLR-generated code, it's:
>
>     try {
>         mIdentifier(false);
>         if ((_tokenSet_6.member(LA(1))) && 
> (_tokenSet_7.member(LA(2)))) {
>             mIgnore(false);
>         }
>         else if ((LA(1)=='(') && (_tokenSet_8.member(LA(2)))) {
>         }
>         else {
>             throw new NoViableAltForCharException((char)LA(1), 
> getFilename(), getLine(), getColumn());
>         }
>         match('(');
>         matchNot(':');
>     }
>     catch (RecognitionException pe) {
>         synPredMatched255 = false;
>     }
>
> What I *want* to happen is for execution to pick up at the "else if" 
> above, but
> since mIgnore throws a RecognitionException, it jumps to the "catch" 
> which is
> *not* what I want.
>
> It seems to me that the ANTLR-generated code for "Ignore" should *not* 
> throw a
> RecognitionException for my second case.  Why doesn't the generated 
> code
> explicitly check for ':' after '(' and if the character is *not* ':' 
> simply
> exit?

Hmm...this is odd.  You have k>=2 I see.  It should not enter ignore if 
there is no "(:".  Can you tell me what the _tokenSet_7 set looks like 
from:

>         if ((_tokenSet_6.member(LA(1))) && 
> (_tokenSet_7.member(LA(2)))) {
>             mIgnore(false);
>         }
>         else if ((LA(1)=='(') && (_tokenSet_8.member(LA(2)))) {
>         }

it should not enter this first IF and should go to the second else.  If 
you turn on the codeGenBitSetThreshold to a big number option (or 
whatever it's called) it should list the chars it's testing for LA(2).

I think that is our key.

> How can I get what I want?

You could swat the fly with a hammer (read that "hack" it) by adding a 
semantic predicate:

         |   (Identifier (({LA(1)=='('&&LA(2)!=':')||(is whitespace)}? 
Ignore)? '(' ~':')=> ...

Shouldn't be necessary though...let's explore the lookahead set.

Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com