[antlr-interest] Ambiguous/Nondeterministic Grammars (Was: Working around the LALL(k) vs. LL(k) problem ?)

Thu Mar 21 15:14:34 PST 2002

I have found the answer to my own question of whether or not a tool
exists for validating whether or not a general context-free grammar is
unambiguous, and the answer is no.  More information is at the end of
this message, but first let me address other points that have been
raised.  I apologize that this has gotten somewhat lengthy, but I've
learned quite a bit about this over the past few days, and I figure
that it's worth summarizing for others.

On Thu, Mar 21, 2002 at 08:54:29AM -0800, Terence Parr wrote:
> 
> On Thursday, March 21, 2002, at 06:24  AM, mzukowski at bco.com wrote:
> 
> > I'm curious why it is important to you to have no syntactic ambiguity.  
> > What
> > kind of language are you designing?
[...]

Nothing terribly interesting; I'm just reinventing the wheel for fun.
I'm designing a general-purpose OO language, complete with multiple
inheritance, polymorphism, genericity, the whole nine yards.  I don't
expect anybody to use it, I'm just doing it as an exercise, so to
speak.

I want the syntax to have an unambiguous context-free grammar because
that's part of the exercise.  It seems to me that such a syntax is
likely to be more easily readable by humans (as well as machine
parsers), and I'm shooting for a very clean, readable syntax.  It
should be possible to glance at any statement in the language and know
right off the bat (without context) what each part of it means
syntactically.  This implies that the syntax must be expressible
unambiguously in a context-free grammar.

So please note that where I have said "unambiguous" in previous
messages, I have meant it in the sense that the syntax can be
specified in a context-free grammar such as EBNF.  In this message, I
have tried to be explicit about what sort of ambiguity I'm talking
about.  A langauge spec can be perfectly unambiguous, but still be
impossible to specify unambiguously in EBNF.  C/C++ is a good example
of this; one cannot write an unambiguous EBNF grammar of C, but the
langauge spec is, in fact, unambiguous.

> Technically, you want *any* computer language you design to be 
> syntactically unambiguous.  Even if-then-else is unambiguous since the 
> language manual says "ELSE goes with the most recent IF".  Remember that 
> a language is just a valid set of sentences and however you say it is 
> ok.  The trouble is when we try to formalize the language specification 
> with a grammar.
[...]

If-then-else cannot be specified unambiguously in a context-free,
priority/predicate-free grammar, such as EBNF.  That doesn't mean that
the syntax is ambiguous, or that the syntax can't be specified
unambiguously in a context-sensitive grammar (as achieved in ANTLR
using predicates).  But doing so violates my design requirement above:
that each statement be parsable out of context, and without resorting
to putting priorities on the rules of the grammar.  (This is the
approach taken by most parsers in the face of ambiguity: simply say
that in case of ambiguity, rule A overrides rule B.  This is difficult
for humans to remember, and therefore leads to less readable code.)

As an example, the solution to the if-then-else problem, if an
unambiguous context-free grammar is desired, is to require explicit
block terminators in the syntax, as so:

  if condition then
     if condition then
        statement
        statement
     else
        statement
        statement
     end
  end

This is unambiguous in the context-free sense and, IMO, more readable.
Yes, it's more verbose, but I think readability is more imporatant
than brevity for my purposes.  The EBNF version of that is:

if: "if" expression "then" (statement)* ("else" (statement)*)? "end";

Note that "end" cannot be elided in any case.  (Alternatively, a
syntax could use curly braces, as in C, but some way of explicitly
terminating the block must be provided.)

Obviously, if you're writing a grammar for a predefined language spec,
you do not have the luxury of modifying the syntax just to make your
grammar/parser cleaner.  If this is the task at hand, a tool like
ANTLR is definitely what you want, as its predication mechanisms make
it possible to specify a correct grammar for any syntax.  Fortunately,
I'm developing the language spec from scratch, so I do have the luxury
of adapting the syntax spec to make the grammar/parser more elegant.

Which leads me on to the answer to my original question regarding
validation of unambiguity.  I found a tool called Accent
(http://accent.compilertools.net/), and at first I thought it was the
perfect solution to my problem.  It is able to generate a parser for
*any* EBNF grammar.  (By their own admission, these parsers may be
much slower than LL(k) or LR(k) parsers.)  This means it can handle
things that cannot be handled by LL(k) grammars, and things that can't
be handled by LR(k) grammars.  (I mean straight, non-predicated LL(k)
and LR(k) here.  Accent can't do anything that ANTLR can't.)  This
means you can write your grammar without thinking about the parsing
strategy.  Unfortunately, as discussed at
(http://accent.compilertools.net/overview.html), Accent will happily
allow you to create ambiguous grammars, and will generate parsers that
only detect the ambiguity at run-time, when presented with actual
ambiguous input.  (Accent provides an annotation mechanism that serves
the same purpose as ANTLR's predicates; to resolve the
ambiguities/nondeterminisms.)

Following the link on the Accent home page to the ACM article
(http://www.acm.org/crossroads/xrds7-5/bison.html), I found a brief
discussion of ambiguity at the bottom, which claims (with a reference
to an academic paper), that it is impossible to validate the
unambiguity of a general EBNF grammar.  (Specific grammars can be
proven unambiguous by hand, but there's no algorithm for validating
grammars in general.)  They state that you simply have to test your
grammars with lots of different inputs to see if they're ambiguous.

Personally, I'd rather specify it in a predicated LALL(k) or LALR(k)
grammar, which will tell me during parser generation if there's a
nondeterminism, without relying on testing to discover the problems.
(Same reason I prefer statically type-checked languages.)  If you can
write a deterministic LL(k) or LR(k) grammar for the syntax, then the
syntax is guaranteed to be unambiguous in the context-free sense as
well.  However, if you use predicates to get around the limitations of
these parsing strategies, then you've opened that possibility that
your syntax will ambiguous in the context-free sense.  (Even those it
is unambiguous in the general sense.)

So the end result is that I'm out of luck.  I can either shackle
myself to the limitations of unpredicated LL(k) or LR(k) grammars
(which would guarantee context-free unambiguity as well), or use
predicates, knowing that doing so may cause my syntax to be ambiguous
in a context-free grammar.  I think I'm going to choose the latter
option, and just limit the use of predicates as much as possible.

That might be a good point to include in the book.  ;-)

Thanks for all the help, folks.

Randall Nortman

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/