[antlr-interest] the nihilistic circle hoist

Wed Jan 5 13:02:36 PST 2011

> First and foremost, I'm treating the rule
> "args" as though it were written "args : ID+;"

Fine in theory. In practice, it conflates two distinct bugs
(which I say because the resulting parser rejects the
entire universe of possible input strings, which is quite
different behavior, and X X* should never behave
differently than X+, to my way of thinking).

> Problem 1: The rule "args" is not always executed behind the same predicate.
> This is easily corrected by adding a second rule "args2"

I agree you could always work around the hoisting bug by
rewriting your grammar. That's why I allow it might
be possible to implement hoisting correctly via grammar
transformations.

> This condition *must* be checked as part of the loop condition in the rule
> "args".

Not quite. To implement hoisting correctly, the predicate *must*
be fired when "args" is invoked via derivation from "optional"
and must *not* be fired when "args" is invoked via derivation
from "mainprog". This is what is broken. Antlr is, so to say,
combining states that cannot be combined (which is why
rewriting the grammar to split one nonterminal into
two works -- it breaks apart states that ANTLR
is erroneously combining, which can arise when a predicate is
hoisted into a nullable nonterminal via its FOLLOW set).

> Solution: Ignoring the only truly acceptable solution of not using syntactic
> predicates, you should add the following to my list of rules about semantic
> predicates.

I'm not sure how that translates into a generic rule, but if it's
"don't use a sympred if removing it causes a warning", then
it seems like the new list of constraints has eliminated virtually
all practical uses. Presumably, you shed no tears. :-)

Yet another way to look at why this is a fundamental
implementation flaw is to use the predicate hoisting
bug to inject two unrelated predicates into
a third unrelated nonterminal:

grammar hoist1;

WS  : (' '|'\t'|'\n'|'\r')+ {skip();};
ID  : ('a'..'z')+;

start : mainprog optional? EOF;

mainprog : '(' args ')';
args : ID ID*;

optional: ':'  args element ':' args element2;
element : {true}?=>args2;
element2: {false}?=>args3;
args2 : ID ID*;
args3 : ID ID* ;

This will inject two different unrelated predicates into "args".
When you look at the resulting code for args,
well, it's a bit bonkers:

    public final void args() throws RecognitionException {
        try {
            // hoist1.g:11:6: ( ID ( ID )* )
            // hoist1.g:11:8: ID ( ID )*
            {
            match(input,ID,FOLLOW_ID_in_args69);
            // hoist1.g:11:11: ( ID )*
            loop2:
            do {
                int alt2=2;
                int LA2_0 = input.LA(1);

                if ( (LA2_0==ID) ) {
                    int LA2_2 = input.LA(2);

                    if ( (LA2_2==ID) &&
((!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||(false)||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||(true)||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false))))||!((((true)||(false)))))))
{
                        int LA2_3 = input.LA(3);

                        if ( (!((((false)||(true))))) ) {
                            alt2=1;
                        }

But my point here is that the algorithm clearly thinks it should be
calculating a Boolean expression that is based on two unrelated
predicates. It makes no sense to be combining unrelated predicates
into a single decision. (note that element and element2 are siblings
and have no part of their parse trees in common -- their predicates
are not relatable to each other).

For hoisting to work correctly in these cases,
the code would have to be distinguishing at runtime which of the
(arbitrarily many) unrelated predicates (or none of them!) is
"active" at that moment in the parse. Or else implicitly rewrite
the grammar beforehand to avoid the problem.