[antlr-interest] the nihilistic circle hoist

Sun Dec 26 15:38:03 PST 2010

I ran across this puzzling case from the archives:
http://www.antlr.org/pipermail/antlr-interest/2007-June/021483.html

grammar Simple;

FOO : 'foo' ;

section : element* EOF ;
element : {true}?=> pre ;
pre : FOO+ ;

Which resulted in the predicate being "hoisted" into "pre",
apparently with the result that the grammar can't match
any input.

Note that this grammar is unresolvably ambiguous, as is
always the case when you have X* and X derives a
Y*. The example is derived from a more complex case,
but either way (and however one might question the
desirability), one could argue that the predicate
lets you resolve the ambiguity by saying "if there is
at least a single 'FOO', then let 'pre' eat them all
up, resulting in a single 'element' (despite the * on
the 'element').

The generated parser
cannot accept the intended language, and this fact
can be determined statically in advance with no
understanding of what  the predicate actually says.
If the predicate were "X()" instead of "true", it can
be seen (by examining the generated code)
that pre() can only be entered if "X()"
is true, while pre() is guaranteed to throw an
exception if "X()" is true. It is the same
predicate in both cases that results in
the catch-22, not some combination of
multiple predicates that might indicate this
is actually the desired outcome of the grammar
writer.

My naïve (possibly wrong) approach to
doing predicates in LL(1) would start with
rewriting to BNF:

grammar Simple;

FOO : 'foo' ;

section : elements EOF ;
elements:| element elements;
element : {true}?=> pre ;
pre : FOO foos ;
foos :| FOO foos;

Predicate "hoisting" then is vaguely like calculating FIRST() sets,
but the elements of the sets are not just terminals, but predicates
and their associated FIRST() set terminals. In this example,
the predicate (call it P1) has just one terminal associated
with it (FOO). Using PFIRST() to name these sets, FIRST(elements)
contains FIRST(element), so PFIRST(elements) would also
contain {P1,FOO} (meaning, if the next token is FOO, then
P1 must be true to proceed). Likewise, PFIRST(section) would
be equal to PFIRST(elements), so the predicate has been
"hoisted" to two symbols "above", as users would expect
from the name "hoist".

It is in calculating PFIRST(foos) that the opportunity for
Catch-22 arises. Because foos is nullable, FIRST(foos)
must include FOLLOW(foos). FOLLOW(foos) must include
FOLLOW(pre), which must include FOLLOW(element)
which must include FIRST(elements), which is FIRST(element),
which is FIRST(pre). Yikes, there's predicate P1 again,
waiting to get "hoisted downward" into foos.

So my naive LL(1) algorithm has produced somewhat the
same effect as the ANTLR algorithm. But, this algorithm
is clearly wrong, then. The only way that "foos" can be
reached is after P1 has (or should have) already
completed its function in life. While it's tempting to
view predicate hoisting as analogous to the
calculation of FIRST() sets, that didn't work so hot here.

The previous observation suggests that "hoisting"
should only be extended to nonterminals that can
actually derive a string that starts with the nonterminal
containing the predicate being hoisted.