[antlr-interest] Bug fixes in C target for ANTLR 3.1b2

Jim Idle jimi at temporal-wave.com
Thu Jul 17 12:35:46 PDT 2008


C Target Update for ANTLR 3.1 Beta 2

For ANTLR 3.1 beta 2, the following bugs have been address in the C
target, as well as other minor bugs without JIRA entries. If you feel
that you have reported an error that is not shown as fixed here, then
please let me know as soon as you can and I will try to include fixes
for all outstanding bugs, at the very least before we release 3.1. 

Once again, I will make a rather futile request for someone to write
unit tests for the C runtime. It is rarely a good idea for the
programmer to write their own unit tests, though I will try to put some
regression tests in place, if there are no volunteers.

Please see general announcement for Beta 2 for downloading instructions.
Hint: Try the download link from the main ANTLR page - look for the
snapshot/interim releases if you want bug fixes AFTER the official Beta
2.

Bug fixes - reported by users


ANTLR-277, ANTLR-290 - Template incorrectly referred to <name> rather
than <rule> causing incorrect code gen.
ANTLR-289 - Casting of rewrite exception message was missing, causing C
++ compilers to complain.
ANTLR-278 - Pop of a stacked input stream that was sitting at EOF would
cause premature end of input stream.
ANTLR-280 - Potentially, a token without any associated text could cause
the default error display function to de-reference NULL.
ANTLR-281 - Code generation template missing from all targets added.
ANTLR-282 - Ensure that NULL guards are removed from attribute/label
access - causes  parsing failures in all targets
ANTLR-283 - Invalid options for lexer only grammar cause incorrect code
generation. (Note that an ANTLR fix is also required for this to work,
but that this fix might be to reject the grammar because it has invalid
options for lexer grammar).
ANTLR-284 - Change error recovery to reflect Java runtime improvements
ANTLR-308 - #ifdef __cplusplus } #endif #endif generated in wrong order.


Bug fixes - not reported by users

ANTLR-285 - Ensure that all examples are up to date with ANTLR 3.1b2
[No JIRA bug] - Debugger did not send all signals correctly to ANTLR
Works


Still outstanding

ANTLR-275 - Memory free of rewrite streams can cause double release
Tree walkers encountering syntax errors in the tree inadvertently
release the wrong memory pointer
Other minor bugs as per C Target Bug List
ANTLR-276 - Finish ANTLRWorks debugger testing - volunteers?

Improvements/Enhancements

Configure uses more stringent checks for headers (compile vs
pre-process)

ANTLR debugger code only included in debug version by default unless
configure is told to include it anyway. This removes the dependency on
socket libraries for released versions of the runtime library. Use
option --enable-antlrdebug/--disable-antlrdebug on UNIX, use #define
ANTLR3_NODEBUGGER on Windows (though this is less relevant).

All initialization of follow sets (bit maps) is removed from parser
startup, which reduces generated code size and startup times
substantially for large grammars. Bitsets are loaded dynamically from
static bit list declarations during error recovery.

AST building performance improvements. AST rewrite nodes were being
created with a description string to say where they came from. These
were all constructing pANTLR3_STRINGs, which is a huge overhead given
that a user would virtually never use these strings. These nodes now
just store the pointer to the static string in the generated code. Need
to consider an option to not generate things like these description
strings unless -debug mode is turned on. This will potentially reduce
code size for all targets.


Improvements Scheduled for Next Beta/Release Version

A number of things are too important to leave until 3.2.

ANTLR-279 - Building and rewriting of ASTs via abstract interfaces got
lost along the way somewhere int he previous release. For final release
of 3.1, this will all be corrected such that you can once again build
trees of any node type and so on. At least one user is waiting for this
- apologies that this did not make it in to this beta, it is important.

In relation the tree building via rewrite stream improvement already
completed, there are often a lot of createTokenText() calls. For
instance all the imaginary nodes and tree root nodes can be subject to
this. At the moment such nodes create pANTLR3_STRING which is quite
expensive considering that in general, only ANTLRWorks wants to know
about the text values. Hence tokens will be changed to use unions and
either store just a pointer to static strings OR a pointer to
pANTLR3_STRING (but this will only be present it setText is called on
the token by the user actions in lexers). For large trees, such as
complete translation units for programming languages, this should
provide significant performance improvements for AST building parsers
and tree parsers.

Memory allocation policy. This will change to malloc rather than calloc,
with specific initializations when necessary. Prior to recent
performance improvements the implied memset() for calloc was a minor
overhead. However, this has now moved to be about 11% of all parse time.
Experiments show that a lot of the memory is then initialized with
function pointers anyway and so it will be a worthwhile improvement to
abandon calloc in favor of malloc. Not sure why I did that in the first
place - there must have been some reason - I better check my
comments ;-)

AST node memory allocations. Though almost all elements of the C runtime
are allocated via factories, certain vectors used for rewriting AST node
children are not allocated from a vector factory. Once memset overhead
is removed, this is the single biggest performance overhead for building
ASTs, although by this point there is almost nothing left to optimize
and it really just appears to be an overhead when considered as part of
relative execution times. Worth improving upon though.

Much better C++ integration via supplied C++ class facades for the more
the most frequently used parts of a C parser and probably generated
class facades for the lexer/parser/etc themselves.

Code clean up, comment review and API docs. These will all be verified
and made complete, as well as applying consistent formatting across all
source codes. 


ANTLR 3.2 (C runtime)


I feel, but have not verified, that the overhead in being able to
override every method of every construct in the C runtime is not worth
the performance penalty for many applications, hence I am going to
experiment with a C-lite runtime, that is much closer to the metal and
abandons any concept of reflecting the Java runtime. If you are happy
with the AST structures and walkers that it would produce, and do not
need custom tokens, nodes and LA/LT etc, then this will be both smaller
and faster. For instance input streams will run directly against memory
pointers, trees will be simpler memory structures and so on. It won't
manage memory for you very automatically, but you will be as close to
the metal as can be. I will produce this runtime if experiments show any
significant performance benefits.


C++ Target news

As soon as this release is complete, I will put some major effort into a
true C++ target. Though I may abandon this idea after playing, I am
going to look into producing an ANTLR runtime target generator for
object oriented languages. The idea being that by specifying some
general policies for code generation you could produce say a C++ runtime
directly from the Java runtime source. Such things are fraught with
difficulty so this would not be a general converter of Java to X, and
will probably not be template driven in general (as much as I like that
idea) - consider it an experiment that may not come off if I cannot make
it generate a runtime that is close to one I would write myself. Either
way, I will produce a fully functional C++ target.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080717/de96c182/attachment.html 


More information about the antlr-interest mailing list