[antlr-interest] OPEN, EOF ambiguity

Sun Apr 11 11:33:13 PDT 2010

This is an interesting example.

EBNF is handy when prototyping a grammar, but IME causes more
problems than it solves in the long run, usually in the form of obscuring
where the real problems lie. Seems like it is just quite difficult to give
intelligible (to non parser experts) error messages in the face of EBNF
constructs.

If I read this particular example correctly, the '+' EBNF operator is
being applied to a nullable non-terminal. I'm thinking that's the highest
level at which this error could have been identified. Without EBNF,
I think the grammar writer would be led to write rules
that presented with a left recursion problem, again
due to the nullability of cltext, but would have been closer
to the real issue (assuming the left recursion error
messages were kind enough to point out which leading
right-hand side non-terminals were nullable).

OTOH, maybe EBNF is really a good thing if enough
effort is invested in high-level error reporting. I can't see
that it would be impossible for a generator to be smart
enough to flag the '+' and say:
    Can't decide whether to keep repeating 'cltext', which
    can expand to the empty string because 'text' can
    expand to the empty string.
When you have doodads like preconditions, this gets
harder, presumably.

> I guess I'm not clear on why ANTLR is seeing
> OPEN and EOF as an ambiguity.

You tell the computer: "recognize any number of 'cltext'".
But 'cltext' can match the empty string (via 'text'). So,
no matter what token the parser runs into, just how many
empty 'cltext' items do you want it to recognize first?
Zero? One?, 10,000?

The error message is confusing because it is too
low-level and mentions OPEN and EOF. OPEN and
EOF just happen to be the tokens that would be legal
at that point. But the real issue is that you've asked
to recognize an indefinite number of empty strings
that could *precede* that OPEN or EOF.

Computers being literal-minded, this is an ambiguity.
And, in fact, people writing parsers often *do* want
empty strings recognized at various points to get
associated actions triggered before or after a
particular item. But asking to recognize an
indefinite number of empty strings is inherently
ambiguous (nothing really to do with the particular
algorithm you use to parse the grammar). As
the punchline to a bad joke goes:
"How do they know when they're done?"

>  However, the language spec that I must conform to
> specifically states that a text can have zero or more
> phrases (which is the crux of the problem).

Presumably that spec writer simply ignored the fact
that that leads to an infinite number of zero-phrase
'text' items in every input. Since you are the implementer,
however, you must explicitly account for all the
ambiguities the spec writer did not bother to
disambiguate. If specifications were completely
specific, then they could be automatically turned
into code with no need for a programmer. :-)