[antlr-interest] PEG-style "and" predicates

Tue Jul 3 00:52:52 PDT 2007

El 3/7/2007, a las 2:31, Richard Clark escribió:

> It took careful reading of all the available material, but here's what
> I know about predicates in v3 (coming from an old hand at v2):
>
> 1. Syntactic predicates (of the form (FOO) => foo) are used to remove
> the ambiguity among a set of rules. **If a static analysis shows no
> ambiguity, the syntactic predicate's code is never generated.** These
> may be "hoisted" out of the defining rule to wherever needed.
>
> 2. Semantic predicates ( {input.LA(1) == FOO} => foo ) are the same as
> syntatic predicates except you get to specify arbitrary code. They too
> won't be compiled in if there's no ambiguity in the grammar and may be
> hoisted to where needed.
>
> 3. Gated semantic predicates ( { myHelper(); }? => foo ) are always
> compiled in. They're normally used when outside forces change how the
> grammar is handled (e.g. command line options.)

Your comments on 1 and 2 match my understanding. I didn't actually  
know that flavor 2 of semantic predicates existed! That is, I only  
knew about these forms:

Syntactic predicate: (foo)=> bar

Validating semantic predicate: foo {bar}?

Gated semantic predicate: {foo}?=> bar

Disambiguating semantic predicate {foo}? bar

> I was surprised you're doing this in the lexer. It seems that if the
> token changes meaning depending on its position in the line, it's the
> parser's job to determine such things. (In my opinion, your mileage
> may vary, void where prohibited, etc. etc. etc.)

The impression I've gotten on the list is that "the ANTLR way" is to  
generally keep the lexer as simple as possible and do as much work in  
the parser. In this case, however, I'm using a filtering lexer to  
process wikitext and there are some symbols which only have special  
meaning at certain places on the line (basically, first thing or last  
thing on the line). For example, a level-2 heading looks like this:

== foo ==

In this particular case it seems much simpler to let the lexer handle  
this than the parser (which I am writing by hand because for this  
language, it seems that ANTLR isn't the right tool for the job);  
basically, I don't want the parser to have to care about whitespace  
and position on the line etc: it just acts as a simple translator  
acting on a stream of tokens. That way I can have input like the  
following:

== Does foo == bar? ==

And the lexer will do the right thing, tokenizing this as H2_START,  
CHARS, H2_END. The parser duly transforms this to:

<h2>Does foo == bar?</h2>

El 3/7/2007, a las 2:55, Richard Clark escribió:
> BTW. the fate of predicates is a little ambiguous (so to speak). Page
> 318 of the ANTLR hymnal says this about semantic predicates:
>
>> For efficiency reasons, ANTLR hoists predicates
>> into DFAs only when LL(*) lookahead alone is
>> insufficient to distinguish alternatives.
>
> Yet on page 333:
>
>> To be precise, manually specified syntactic
>> predicates become gated semantic predicates.
>> These syntactic predicates are always evaluated
>> just like gated semantic predicates.
>>
>> On the other hand, syntactic predicates implicitly
>> added by auto-back-tracking mode become regular
>> semantic predicates. Such predicates are evaluated
>> only when LL(*) fails to predict an alternative.
>
> So, I believe my point #2 was wrong -- if you add it by hand (as you
> did), it should always be evaluated.
>
> Thus we have a mystery.

Well, to clarify, it's not that it isn't being evaluated; looking at  
the generated code I can see it isn't even being generated! For  
example, look at the following rules:

   fragment H6     : '======' ;
   H6_END          : (H6 ' '* ( '\n' | '\r' | EOF))=> H6 ;

The intention is to match "======" as H6_END only if it is the last  
non-whitespace thing on the line.

This is a snippet of the generated mTokens() function; you can see  
that if a "=" is seen then a syntactic predicate is checked to see  
whether or not our rule should be matched:

   if ((LA19_0 == '='))
   {
       {
           int LA19_7 = LA(2);
           if ( (synpred20(ctx)) )
           {
               alt19=18; // It is I
           }

The first thing to note is that the LA19_7 variable which is  
initialized with LA(2) is never referenced after initialization. The  
other thing to note is how the syntactic predicate gets generated:

static ANTLR3_BOOLEAN synpred20(pWikiTextLexer ctx)
{
     ANTLR3_UINT64   start;
     ANTLR3_BOOLEAN  success;

     BACKTRACKING++;
     start	= MARK();
     synpred20_fragment(ctx);	    // can never throw exception
     success	= !(FAILEDFLAG);
     REWIND(start);
     BACKTRACKING--;
     FAILEDFLAG	= ANTLR3_FALSE;
     return success;
}

This in turn calls the corresponding fragment function:

static void synpred20_fragment(pWikiTextLexer ctx )
{
     {
         mH6_END(ctx );
         if (HASFAILED())
         {
             return ;
         }
     }

     goto rulesynpred20Ex; /* Prevent compiler warnings */
     rulesynpred20Ex: ;
}

So this predicate effectively does the following:

- Asks, can I match FOO?
- Try matching FOO
- If succeeds, says, Yes, you can match FOO: proceeds to match FOO  
all over again

I am not actually sure why ANTLR is doing this as it doesn't seem  
very efficient; looking at the mTokens() function almost every single  
prediction employs the same pattern, and only the last alternative in  
each group doesn't use a predicate.

Then, looking at the generated mH6_END method you can see that the  
the actual contents of the syntactic predicate as specified in the  
grammar don't make it into the generated code there either:

static ANTLR3_INLINE
void mH6_END(pWikiTextLexer ctx)
{
     ANTLR3_UINT32 _type;
     _type = H6_END;

     // WikiText.g:69:19: ( ( H6 ( ' ' )* ( '\\n' | '\\r' | EOF ) )=>  
H6 )
     // WikiText.g:69:19: ( H6 ( ' ' )* ( '\\n' | '\\r' | EOF ) )=> H6
     {
         /* 69:19: ( H6 ( ' ' )* ( '\\n' | '\\r' | EOF ) )=> H6 */
         mH6(ctx );
         if (HASFAILED())
         {
             return;
         }
     }
     LEXER->type = _type;

     goto ruleH6_ENDEx; /* Prevent compiler warnings */
     ruleH6_ENDEx: ;
}

Out of interest, I tried altering the grammar in a number of ways to  
see if the desired syntactic predicate code was ever generated:

- I tried using a string literal rather than a fragment rule: no  
difference.

- I tried adding an arbitrary alternative to the rule: at last the  
syntactic predicate got generated and add to the mH6_END function; so  
this confirms the hypothesis that syntactic predicates are ignored in  
rules which don't have multiple alternatives.

- I tried turning filtering off: the prediction logic in the mTokens  
method was different, but the syntactic predicate was still not  
generated.

In any case, I went ahead and used the alternative proposed by Jim  
Idle in another post on this thread:

   H6_END : H6 { MARK(); } ' '* ( '\n' | '\r' | EOF ) { REWINDLAST 
(); } ;

It seems the cleanest, most reusable solution.

Cheers,
Wincent