[antlr-interest] Antlr3c / Tokenization of big files

Wed Sep 26 18:13:04 PDT 2012

You will need to partition the input. Find a logical place to split it,
parse in pieces but keep your symbol tables and so on in tact.

Jim

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of jquesada
Sent: Thursday, September 27, 2012 1:17 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Antlr3c / Tokenization of big files

Hi everybody:

I've been using Antlr3c for the design of a language specification and a
global platform for natural language engineering during the last few
months.
Everything is really nice (I come from the lex/yacc world ... and
derived).

But I'm catched at the following point:

If I use the language model with small examples everything goes fine, but
if I use very large dictionaries (I'm trying at this moment with a
dictionary containing one million words), the system cracks.

The segmentation fault is produced exactly in the newPoolToken function of
the antlr3c library, although I suspects the real problem is that the
memory is exhausted and in fact is a problem of memory corruption.

I've read that antlr tries to tokenize the whole input before applying
syntactic rules, which makes sense with the behaviour I'm tracing.

Is is possible to deactivate or modify the complete pre-tokenization?

Regards,

--
View this message in context:
http://antlr.1301665.n2.nabble.com/Antlr3c-Tokenization-of-big-files-tp757
8903.html
Sent from the ANTLR mailing list archive at Nabble.com.

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address