[antlr-interest] Parsing large files: A trip report

Thu Apr 12 02:38:37 PDT 2012

Hello
just a quick comment.
I you use C target you can use mmap. The whole file will be loaded into
your processes address space, all the reads (and pre-fetches) will be  
handled by OS kernel.

In Java target you can use nio's MappedByteBuffer, which is an  
equivalent for mmap.
For some unknown reason Java designers decided, that max. size for  
Java "mmap" is 2GB.

Ivan

Quoting Nathaniel Waisbrot <waisbrot at highfleet.com>:

> Hello, ANTLR list.  I just finished a mini project where I used   
> ANTLR to convert a 20-gigabyte MySQL database dump into a set of   
> files for ingest into PostgreSQL, and I thought some of you might   
> find my experience interesting.  Also, I had a few problems along   
> the way, and maybe some of you can offer a guess as to what I was   
> doing wrong.
>
> For background, I'd found two previous threads on the subject of large files:
>
> http://www.antlr.org/pipermail/antlr-interest/2009-March/033715.html
> - Vlad wants to parse a 100MB file.  People suggest chunking the   
> file outside of ANTLR.
>
> http://www.antlr.org/pipermail/antlr-interest/2010-April/038129.html
> - Amitesh Kumar wants to syntax-check a large file.  People suggest   
> fixing his grammar, chunking the file outside of ANTLR, and using   
> UnbufferedTokenStream.
>
>
>
> I wanted ANTLR to do the parsing because SQL allows for multi-line   
> quoted strings, so without some kind of parse you can't be sure that  
>  the ';' you're looking at signifies the end of a statement.  I  
> tried  passing the dump file to ANTLR, but discovered that  
> ANTLRFileStream  tries to read the entire file into memory.
>
> I took a stab at rolling my own Stream class,   
> ANTLRUnbufferedFileStream, posted here (   
> http://pastebin.com/gyVsquQK ).  I use Java's RandomAccessFile to   
> handle mark/rewind.  Something must be wrong with my code, though,   
> because when I ran it, I'd get nondeterministic behavior.  One run   
> I'd have an unexpected token around line 20000, the next run, I'd   
> have the same error around line 600000.  None of the errors popped   
> up until it had been running for at least 6 minutes, so I gave up   
> debugging it pretty quickly.
>
> After abandoning that, I determined that since my dump was   
> machine-generated, I could safely assume that a line beginning with   
> "INSERT INTO" was the start of a statement and never part of a   
> string.  That allowed me to chop the file into 23k pieces with an   
> average of 1m characters per line and feed each one to ANTLR   
> separately.  It took 1.5 hours to read in the file and write out the  
>  conversion.
>
> In retrospect, I /think/ that ANTLR was the right choice, since I'll  
>  want to go back and patch in lots of holes.  (The group producing   
> the MySQL dump is going to add a column with the 'geometry' datatype  
>  at a later date, and I'll need to figure out how to translate that   
> into PostgreSQL.)  The grammar is fairly readable, and is doing   
> nearly all of the work.  I'm disappointed, though, that I wasn't   
> able to stream the complete file through ANTLR in one go.  (And the   
> way I'm doing it isn't proof against SQL injection!)  While I was   
> dealing with the memory problems, I was wishing that I had a 'cut'   
> operator like in Prolog, since I'm confident that most of the   
> parsing could be done without any back-tracking.
>
> Suggestions or questions are welcome.
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:   
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.