[antlr-interest] Can one identify the type of parser needed for a given BNF grammar

The Researcher researcher0x00 at gmail.com
Mon Jul 11 12:55:05 PDT 2011


>
> Loring,


Thanks for the feedback.

While I now can understand your answer, I still don't have enough
experience  to implement any approach correctly.

As a help to others who are working on this in private or interested in
creating a C++  parser I will gloss what Loring said because last week it
would have mostly been Greek to me. If I say something wrong, please
correct. I am not afraid to make mistakes in public so that others may learn
from them.

The big thing that one has to learn about parsing C++ is how to handle the
ambiguities. That is the dragon to be slayed here.

1. And then you have to figure out how to prune the GLR-generated "forests".

GLR http://en.wikipedia.org/wiki/GLR_parser

The reason GLR is considered for C++ is that do to the ambiguities of C++
most parsers in a typical fashion have to couple the feedback from the
parser to the lexer and reference a symbol table to correctly parse C++.
This is contrary to what is taught in school and most of the early examples
of parsing you will come across. This allows them to produce an AST without
ambiguities after the parser. This is pushing the work to the front. GLR is
one of the most powerful parsing strategies and can parse C++ without
coupling the parser to the lexer, but does so by generating multiple
subtrees for each ambiguity, for which you must later prune out the
ambiguities from the forests. This is pushing the work to the back.

2. C++ is nasty; it can be parsed with ANTLR (as shown by NEXT and David
Wigg's
adaptions of that grammar).

The ANTLR C++ grammar does so by creating a symbol table and using
predicates during the parse. In my initial conversions of the ANTLR C
grammar to C# I was able to convert it, but did not understand why it did
certain things at first. After researching it was clear that a lot of the
ambiguities of C++ arise when an name is encountered and then it's use ( a
type, an identifier, etc)  has to be know in order to take the appropriate
branch. Thus the creation of the symbol tables and other tables and the use
of predicates. This pushes the work to the front.

3. I believe that the right strategy with ANTLR is actually to use
multi-pass recognition to sort out the ambiguities.  That has not been done
yet.
The problem is that C++ cannot be adequately described with a context-free
grammar; you have to do some context-sensitive processing to resolve the
syntax
that is semantically ambiguous.

It could/should be possible to parse C++ with ANTLR without a symbol table
and associated predicates, and then in the AST analysis verify that the
input C++ syntax is valid C++ semantics.
Remember that the BNF for a grammar does not guarantee valid code/semantics
only correct syntax. So the AST is analyzed to find invalid semantics. I am
not sure if this will require coupling the parser to the lexer or the AST to
the parser, but if I go this route I will soon know.

I know this went off topic but was worth it.

Any answer to the original question is still appreciated.

Thanks

Eric


More information about the antlr-interest mailing list