[antlr-interest] antlr v4 wish list

Wed Mar 30 04:02:55 PDT 2011

Sorry about the misunderstanding there.

I've done some extensive work on lexer performance, but it was focused on
source files only up to a couple of dozen megabytes in a single file. ANTLR
for Java is certainly not equipped to handle large-scale operations even in
the scale I was testing due to some fundamental language limitations. Using
carefully written grammars and my experimental "SlimLexer" implementation
for the CSharp3 target, I've achieved rates of approximately 10MB of source
per second which *significantly* outperformed even the C target.

The lexer implementation planned for ANTLR v4 should approach (and hopefully
exceed) the performance of my SlimLexer, but I don't think there's any
intention to consider gigabytes of source code.

On a side note, I'm assuming the dozens of gigabytes weren't handwritten are
the result of an intermediate tool in the compiler tool chain. I would treat
this as a substantial, unacceptable design flaw in a system designed for
business use. Any practical system for data on this scale use data formats
and layouts which can be efficiently manipulated for the desired
information. This is like replacing a long hallway in an office building
with a maze and complaining that it takes too long to get to the bathroom
and wondering if go-karts might help. The problem exists way before parsing
is ever considered.

Sam

-----Original Message-----
From: Martin d'Anjou [mailto:point14 at magma.ca] 
Sent: Tuesday, March 29, 2011 11:55 PM
To: Sam Harwell
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] antlr v4 wish list

Hi Sam,

With regards to your answer to item 4) Gigantic files, I meant the problem
of lexing and parsing gigantic source files such as verilog netlists which
can be dozens of gigabytes of source code and take hours to lex and parse
due to their size. The problem is reported by
http://v2kparse.blogspot.com/2008/06/first-pass-uploaded-to-sourceforce.html
. 
To quote his blog:

"I was compelled to use ANTLR 2.7.7 since the token stream mechanism does
not try to slurp in the whole source file, an issue which I encountered with
the more recent ANTLR 3.0.

While Verilog source files are not generally large, netlist files can be
humungous, and one can quickly run out of memory by "slurping in the whole
tamale."

Anyway, I've communicated the large file slurp file to the author of ANTLR
and he'll be working out a solution in future releases.

(If you think large verilog netlists are problematic to slurp; think aout a
SPEF file --- where I first encoutered the problem using ANTLR 3.x. Anyway,
back to 2.7.7 works fine, even for large SPEF files.)"

As I said, this might have been fixed already, I just don't know.

Regards,
Martin

On 11-03-29 11:29 PM, Sam Harwell wrote:
> 4. With proper integration into the build system, generated files 
> aren't checked into source control or distributed. The ANTLR project 
> itself generates V2 and V3 grammars, and my .NET projects generate V3 
> grammars (using my C# port of the Tool) at build time, so the 
> generated files never take up space in source control.