[antlr-interest] Parsing large files: A trip report

Tue Apr 10 14:36:07 PDT 2012

Hello, ANTLR list.  I just finished a mini project where I used ANTLR to convert a 20-gigabyte MySQL database dump into a set of files for ingest into PostgreSQL, and I thought some of you might find my experience interesting.  Also, I had a few problems along the way, and maybe some of you can offer a guess as to what I was doing wrong.

For background, I'd found two previous threads on the subject of large files:

http://www.antlr.org/pipermail/antlr-interest/2009-March/033715.html
- Vlad wants to parse a 100MB file.  People suggest chunking the file outside of ANTLR.

http://www.antlr.org/pipermail/antlr-interest/2010-April/038129.html
- Amitesh Kumar wants to syntax-check a large file.  People suggest fixing his grammar, chunking the file outside of ANTLR, and using UnbufferedTokenStream.

I wanted ANTLR to do the parsing because SQL allows for multi-line quoted strings, so without some kind of parse you can't be sure that the ';' you're looking at signifies the end of a statement.  I tried passing the dump file to ANTLR, but discovered that ANTLRFileStream tries to read the entire file into memory.

I took a stab at rolling my own Stream class, ANTLRUnbufferedFileStream, posted here ( http://pastebin.com/gyVsquQK ).  I use Java's RandomAccessFile to handle mark/rewind.  Something must be wrong with my code, though, because when I ran it, I'd get nondeterministic behavior.  One run I'd have an unexpected token around line 20000, the next run, I'd have the same error around line 600000.  None of the errors popped up until it had been running for at least 6 minutes, so I gave up debugging it pretty quickly.

After abandoning that, I determined that since my dump was machine-generated, I could safely assume that a line beginning with "INSERT INTO" was the start of a statement and never part of a string.  That allowed me to chop the file into 23k pieces with an average of 1m characters per line and feed each one to ANTLR separately.  It took 1.5 hours to read in the file and write out the conversion.

In retrospect, I /think/ that ANTLR was the right choice, since I'll want to go back and patch in lots of holes.  (The group producing the MySQL dump is going to add a column with the 'geometry' datatype at a later date, and I'll need to figure out how to translate that into PostgreSQL.)  The grammar is fairly readable, and is doing nearly all of the work.  I'm disappointed, though, that I wasn't able to stream the complete file through ANTLR in one go.  (And the way I'm doing it isn't proof against SQL injection!)  While I was dealing with the memory problems, I was wishing that I had a 'cut' operator like in Prolog, since I'm confident that most of the parsing could be done without any back-tracking.

Suggestions or questions are welcome.