[antlr-interest] Huge tables in C lexers

Sat Nov 10 20:12:11 PST 2012

You mis-understand Mike. The situation is pretty much the exact opposite
of what you suggest. Java has to do what it does because there is no way
to declare character arrays with pre-initialized data. C/C++ is streets
ahead in this regard.

The tables in Java are expanded at runtime in to the SAME tables that the
C target generates directly. When an executable starts (from the C
target), the tables are immediately available in the dseg and there is no
overhead. The Java target needs to load the compressed version of the
strings at startup, then it decompresses in to memory and uses the
decompressed tables. So, the Java version uses extra memory to create the
compressed strings, then a bunch of CPU cycles to decompress the strings,
then garbage collection the compressed strings, then it can finally use
them. It is not the C target that is inefficient in this regard.

While these days, the tables are not a big deal, you should still limit
them by making your lexer leaner. You are probably telling the lexer to
pick out certain classes of characters (from a spec?). It is better to
relax lexer rules and get them to recognize the roughly correct pattern,
then check the character set and issue a semantic error. If you get a
lexer error, all you can do is say "invalid character" and it pretty much
ends the sequence. Move your errors as far in to the tool chain as you
can, because you will give your users much better context in the errors.

Jim

-----Original Message-----
From: Mike Lischke [mailto:mike at lischke-online.de]
Sent: Saturday, November 10, 2012 8:23 PM
To: Jim Idle
Cc: Borneq; antlr-interest at antlr.org
Subject: Re: [antlr-interest] Huge tables in C lexers

Hey Jim,

> Please see ANTLR.markmail.org. The issue is with your grammar. The
> Ctarget lays out static tables that the compiler can then use directly
> in memory structures loaded from the executable target. So you see the
> full data set. Java creates compressed strings which must first be
> created at start up time and then decompressed to generate the same
> tables as C.
>
> Review your grammar by looking at which of the tables are so big and
> correlating these with your rules. You should be able to see the
> issue.
>
> Jim
> (At the 200th time of answering this one ;)

.. which shows how important this issue is. My lexer is 32MB in size, just
because of these tables. This stems from the fact that I have to allow
almost the entire Unicode BMP for my identifiers. Without that the lexer
shrinks to 7MB. Maybe it would be worth implementing a similar compression
feature in the C target too? Do you know if this same problem will also be
existent in ANTLR v4?

Mike
--
www.soft-gems.net