[antlr-interest] Can one identify the type of parser needed for a given BNF grammar

Loring Craymer lgcraymer at yahoo.com
Mon Jul 11 20:59:22 PDT 2011


Sigh.  'class' is a keyword in C++, so your first statement is wrong.  What you propose is exactly what was done in the NEXT/Wigg parser; the cost is in parser synpreds (sempreds in Wigg's grammars; ANTLR 2 did not hoist synpreds, so Wigg had to work from the PCCTS generated code in getting a workable ANTLR 2 grammar) and tremendous backtracking overhead.  I rather suspect that the result is considerably slower than the multi-pass approach would be; the NEXT/Wigg grammars do lots of backtracking as a matter of course.  From studying the Wigg grammar, I estimate that there is at least a factor of two overhead from backtracking; tree walkers are faster than parsers (excluding actions) as the result of having a more canonical form (less lookahead; k=1 is normal).

By separating semantic analysis from syntactic analysis, you can support more comprehensive error reporting--multiple errors reported without having to implement parser error recovery--with line/column numbers (stored in the tokens).  Being in a rush to report the first error encountered is silly:  most of us can afford to wait an extra few milliseconds (or perhaps longer; more comprehensive error reporting costs in terms of display time).

Symbol tables are neither magic nor a "one size fits all" technology--there is no one true SymbolTable class.  The information stored in symbol table entries varies across languages, and there may be multiple symbol tables for a given language (having a type table separate from the identifier symbol table is common).

Folding preprocessing into the lexer can be done; I think Jim Idle did that for C, but I could be mistaken.  Whether the added complexity is worth the improved processing time over having a separate preprocessor, I have no idea.  Certainly, that is best done as an optimization step; using a separate preprocessor for early development simplifies the problem of finding bugs in the lexer.  It is never a good idea to unnecessarily add error sources.

So:  you are wrong on almost all counts.  You need more practical experience with language translation, and a bit of "book learning" would not be out of order.

--Loring




>________________________________
>From: Douglas Godfrey <douglasgodfrey at gmail.com>
>To: Loring Craymer <lgcraymer at yahoo.com>
>Cc: The Researcher <researcher0x00 at gmail.com>; antlr-interest at antlr.org
>Sent: Monday, July 11, 2011 4:33 PM
>Subject: Re: [antlr-interest] Can one identify the type of parser needed for a given BNF grammar
>
>
>If the SymbolTable is not built by the Parser then you cannot generate the correct AST
>for a class definition vs a function call. Also you cannot generate prompt errors close to
>the point where the error occurs in the parse. 90+% of all semantic ambiguities can be 
>resolved by a SymbolTable in the Parser. Most C++ programs will not have any statements 
>that cannot be resolved inline in the parser and would need to be resolved in a later AST
>TreeParser phase. The 90% case should not pay the compile time cost of a separate 
>TreeParser just to resolve ambiguities that can be handled by the Parser.
>
>No feedback is required for the Lexer. The Lexer can resolve all pre-processor statements
>before the Parser is invoked. The Lexer uses it's own primitive SymbolTable to handle 
>text substitutions and macros. Again, 90+% of all C++ programs do not use any complex
>macros where a second lexer pass would be needed. For simple includes and text 
>substitutions the Lexer can handle them inline. In 10% of the cases The Lexer needs 2 
>passes where the first pass processes all includes, macros and substitutions and the 
>second pass generates the token stream for the Parser.  
>
>
>On Mon, Jul 11, 2011 at 5:16 PM, Loring Craymer <lgcraymer at yahoo.com> wrote:
>
>On 3.) :  The parser just recognizes syntax and ignores semantic ambiguities;
>>then in a first tree walker pass, the symbol tables are constructed; a second
>>tree walker pass uses the symbol table information to resolve ambiguities (which
>>of several interpretations of valid syntax is correct) and rewrites the tree
>>into an unambiguous form.  Then you have finished recognizing C++ and can
>>continue with further processing.  Never, never, never would I suggest adding a
>>feedback mechanism that couples parser to lexer.
>>
>>
>>--Loring
>>
>>
>>----- Original Message ----
>>> From: The Researcher <researcher0x00 at gmail.com>
>>> To: antlr-interest at antlr.org
>>
>>> Sent: Mon, July 11, 2011 12:55:05 PM
>>> Subject: Re: [antlr-interest] Can one identify the type of parser needed for a
>>>given BNF grammar
>>>
>>> >
>>
>>> > Loring,
>>>
>>>
>>> Thanks for the feedback.
>>>
>>> While I now can  understand your answer, I still don't have enough
>>> experience  to  implement any approach correctly.
>>>
>>> As a help to others who are working on  this in private or interested in
>>> creating a C++  parser I will gloss  what Loring said because last week it
>>> would have mostly been Greek to me. If  I say something wrong, please
>>> correct. I am not afraid to make mistakes in  public so that others may learn
>>> from them.
>>>
>>> The big thing that one has  to learn about parsing C++ is how to handle the
>>> ambiguities. That is the  dragon to be slayed here.
>>>
>>> 1. And then you have to figure out how to prune  the GLR-generated "forests".
>>>
>>> GLR http://en.wikipedia.org/wiki/GLR_parser
>>>
>>> The reason GLR is considered  for C++ is that do to the ambiguities of C++
>>> most parsers in a typical  fashion have to couple the feedback from the
>>> parser to the lexer and  reference a symbol table to correctly parse C++.
>>> This is contrary to what is  taught in school and most of the early examples
>>> of parsing you will come  across. This allows them to produce an AST without
>>> ambiguities after the  parser. This is pushing the work to the front. GLR is
>>> one of the most  powerful parsing strategies and can parse C++ without
>>> coupling the parser to  the lexer, but does so by generating multiple
>>> subtrees for each ambiguity,  for which you must later prune out the
>>> ambiguities from the forests. This is  pushing the work to the back.
>>>
>>> 2. C++ is nasty; it can be parsed with  ANTLR (as shown by NEXT and David
>>> Wigg's
>>> adaptions of that  grammar).
>>>
>>> The ANTLR C++ grammar does so by creating a symbol table and  using
>>> predicates during the parse. In my initial conversions of the ANTLR  C
>>> grammar to C# I was able to convert it, but did not understand why it  did
>>> certain things at first. After researching it was clear that a lot of  the
>>> ambiguities of C++ arise when an name is encountered and then it's use (  a
>>> type, an identifier, etc)  has to be know in order to take the  appropriate
>>> branch. Thus the creation of the symbol tables and other tables  and the use
>>> of predicates. This pushes the work to the front.
>>>
>>> 3. I  believe that the right strategy with ANTLR is actually to use
>>> multi-pass  recognition to sort out the ambiguities.  That has not been  done
>>> yet.
>>> The problem is that C++ cannot be adequately described with a  context-free
>>> grammar; you have to do some context-sensitive processing to  resolve the
>>> syntax
>>> that is semantically ambiguous.
>>>
>>> It could/should  be possible to parse C++ with ANTLR without a symbol table
>>> and associated  predicates, and then in the AST analysis verify that the
>>> input C++ syntax is  valid C++ semantics.
>>> Remember that the BNF for a grammar does not guarantee  valid code/semantics
>>> only correct syntax. So the AST is analyzed to find  invalid semantics. I am
>>> not sure if this will require coupling the parser to  the lexer or the AST to
>>> the parser, but if I go this route I will soon  know.
>>>
>>> I know this went off topic but was worth it.
>>>
>>> Any answer to  the original question is still  appreciated.
>>>
>>> Thanks
>>>
>>> Eric
>>>
>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe:
>>>http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>
>>
>>List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>
>
>


More information about the antlr-interest mailing list