[antlr-interest] Antlr3c / Tokenization of big files

Wed Sep 26 10:17:29 PDT 2012

Hi everybody:

I've been using Antlr3c for the design of a language specification and a
global platform for natural language engineering during the last few months.
Everything is really nice (I come from the lex/yacc world ... and derived).

But I'm catched at the following point:

If I use the language model with small examples everything goes fine, but if
I use very large dictionaries (I'm trying at this moment with a dictionary
containing one million words), the system cracks.

The segmentation fault is produced exactly in the newPoolToken function of
the antlr3c library, although I suspects the real problem is that the memory
is exhausted and in fact is a problem of memory corruption.

I've read that antlr tries to tokenize the whole input before applying
syntactic rules, which makes sense with the behaviour I'm tracing.

Is is possible to deactivate or modify the complete pre-tokenization?

Regards,

--
View this message in context: http://antlr.1301665.n2.nabble.com/Antlr3c-Tokenization-of-big-files-tp7578903.html
Sent from the ANTLR mailing list archive at Nabble.com.