[antlr-interest] parsing huge files

Wed Oct 3 10:42:48 PDT 2007

Hi,

cimbroken wrote:

> I'm trying to use ANTLR for a task that I've done till now with perl
> regexp: parse huge log files.
> My primary goal is to program parsers in a more declarative fashion, of
> course, instead of writing by hand tons of regexp, while<> cycles and
> if-elsif statements for every different log type.
> 
> I think my problem is ANTLR's CommonTokenStream, because in method
> fillBuffer() tries to buffer *all* tokens from the Lexer while my needs
> are to parse one log record at a time, discard old tokens, read a bunch of
> new tokens (log files are repetitive and made up of independent records)
> and go on like this.
> Instead, when the magnitude of input files is Gigabytes or more, such
> stream fills up memory even before the Parser starts doing its work!
> 
> I'm not a parsing neither ANTLR expert, so I'm asking some advice about
> changing this behaviour (or at least someone who says: "don't use antlr
> for this!"). Is it possible to do the trick at grammar level, or must I
> subclass CommonTokenStream or ANTLRInputStream, or... ?

You can write your own version of the various streams that consume
characters, tokens on demand in contrast to the current implementation. I
could offer you an implementation of a token stream in Python - this has
been raised before, you may find more pointers in the list archives or the
wiki.

The second point is about discarding old tokens. It should be possible to
call a method from the parser at the end of a 'record', which instructs the
token stream to flush all consumed tokens - that should be pretty save and
straight forward. You may even flush the tokens from the token stream's
buffer on consume(), but there may be circumtances when a negative
lookahead is needed - can't tell of the head...

Well, that's the generic approach and if you can come up with a solution,
feel free to post it to the wiki ;)

It could be much easier, if you can easily split the input into records
(e.g. if record == line) and then feed one record into the parser at a time
and reset it afterwards.

HTH