[antlr-interest] semantic predicate hoisting :)

Tue Jun 8 11:48:03 PDT 2004

On Jun 8, 2004, at 10:10 AM, John D. Mitchell wrote:

>>>>>> "Terence" == Terence Parr <parrt at cs.usfca.edu> writes:
> [...]
>
>> Hooray!  I have an initial implementation of the combined LL(*) and
>> semantic predicate hoisting mechanism.
>
> YEA!!!
>
> Do you happen to have a proof of correctness for this hoisting yet?

Define "correct"! ;)  Good question.  If I can define hoisting 
properly, I should be able to demonstrate my algorithm implements it.  
I plan on an LL(*) paper first as hoisting cannot be done properly 
without the syntactic context.

>> The requirements are fairly nasty to handle difficult situations
>> naturally, but I think my implementation covers them all nicely as
>> opposed to the ad hoc hacking I did for PCCTS (uh...and ANTLR <3.0) ;)
>
> How does it handle:
>
> 	* deeply nested semantic predicates?

It's beautiful.  Once a grammar is properly encoded as an NFA, the 
LL(*)+hoisting algorithm is a variant of the subset construction 
algorithm for DFA conversion (a few Terence tricks in there <wink>).  
The hoisting of predicates is done by collecting all predicates 
together as they are found when walking the NFA.  For example,

a : {p1}? b ;
b : {p2}? ID ;

would AND together p1 and p2 as it walked downwards towards the ID 
token.  This is the "semantic context" that gets carried along like a 
wavefront.  Upon syntactic conflict, the semantic context is checked.  
If sufficient, it is used to resolve stuff.

Now, when multiple predicates in the NFA get you to the same DFA state, 
you need to OR stuff together:

a : b | ID ;
b : {p1}? ID | {p2}? ID ;

Here, you will get a DFA state with NFA configurations inside for both 
p1 and p2 as well as the "naked" alt for 2nd alt of rule a.  To resolve 
syntactic conflicts, the DFA converter combines all predicates within 
the same DFA state for the same alt with an OR condition: p1||p2 
predicts alt 1 of 'a' and !(p1||p2) predicts alt 2.

The beauty of this is that once the NFA is created properly, the simple 
DFA converter will handle any situation indicated by the grammar.  
There is no longer an ad hoc set of rules about what to do upon (..)+ 
then (...)* etc...

Note that I am not allowing a hoisting distance > 0.  This means that 
predicates seen after at least one token cannot be used to disambiguate 
decisions.  For example, the following rule is syntactically ambig and 
cannot be resolved with predicates:

a : ID {p1}?
    | ID {p2}?
   ;

> 	* recursive rules (which contain semantic predicates)?

Here's one of my unit tests:

     public void testLeftRecursivePred() throws Exception {
         Grammar g = new Grammar(
             "grammar P;\n"+
             "a : {p1}? a | ID ;\n");
         g.createLookaheadDFAs();
         String expecting =
                 ".s0-ID->.s1\n" +
                 ".s1-{p1}?->:s2=>1\n" +
                 ".s1-{!(p1)}?->:s3=>2\n";
         checkDecision(g, 1, expecting, null);
     }

Normally left-recursion is verboten, but with a predicate, it knows how 
to resolve the conflict (ID predicts both alts 1 and 2).  The state 
machine is "match ID then if p1, predict alt 1.  Else if !p1 predict 
alt 2.

The great thing about the algorithm is that it works with whatever NFA 
you can build.  A left-recursive NFA is no biggie. :)

>> I have enclosed a small 16k PDF file showing the DFA ANTLR 3.0 core
>> created to predicate the various alternatives of rule a.  Note that 
>> the
>> DFA edges may be tokens (or characters) or predicates unlike a normal
>> DFA.
>
> How does this interact with syntactic predicates?

I haven't figured that out yet ;)  I'm assuming that the two will be 
mutually orthogonal.

>> Prediction is no longer a series of tests, but rather the exact alt to
>> predict is discovered with one shot.
>
> Please elaborate.  You're still, in essence, traversing the (LL*) DFA 
> but
> in an exact fashion rather than via approximations, right?

Well, yes, no more approx lookahead (except as an optimization).  What 
I mean here though is that previously I would do an if-then-else for 
this rule:

a : A B | A C | A D | B E | B F ;

Now, I will only compare two symbols of lookahead as a normal DFA would 
do to "search" for the matching string.  A series of LA(1)==A && 
LA(2)==B type tests is O(2m) for m alts whereas this DFA is actually 
O(2) (i.e., insensitive to the number of alternatives).  yehaaaa!

>> Oh, the i=>alt notation means "DFA state i predicts alternative alt". 
>> :)
>
> You might want to qualify the alt with the rule name to make that 
> clear.

Yeah, that's just a serialization thing I do and is specific to each 
decision state...

>> Wooohoooo!  NOW, can I go home, John? (Mitchell's been bugging me for 
>> 9
>> years to get hoisting back) <snicker>
>
> Not until you clean your plate, mister!  :-)

Almost clean ;)  I'm refactoring the code now since I have about 50 
unit tests for the convertor.  I want to make the code more readable 
now that it works ;)  Comments are up to date as well.

Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/