[antlr-interest] Streaming Support
Horst Dehmer
horst.dehmer at gmail.com
Mon May 17 07:01:27 PDT 2010
Hi there,
I'd like to parse 'large' streams (200 MBytes and more) with only
small chunks of data (tokens/characters) in memory at a time.
The goal is that the parser/lexer should block until more chars are
available from the underlying input stream. I have a few simple
'callbacks' embedded in the grammar which call into the business logic
to process recognized data. But with the standard setup, the callbacks
are just called after the complete input stream was read.
// uncompressed replication transaction.
transaction
: { if(callback != null) callback.startTransaction(); }
x01 (update_type)+
{ if(callback != null) callback.endTransaction(); }
;
update_type
: entityId = entity_id '{' (values = basic_update)+ '}'
{ if(callback != null) callback.updateType(entityId, values); }
;
basic_update returns [List<String> values]
@init {
values = new ArrayList<String>();
}
: '{' s = value { values.add(s); } ('|' (s = value
{ values.add(s); } )? )* '}'
;
There are a few reasons why I'd like to do it this way:
1. the data is received in rather small chunks (< 4k or so) from NIO
sockets
2. I don't want to buffer the data on the file system (file I/O)
3. have as small a memory footprint as possible
4. it is possible that many streams are processed/parsed at one time
I'm using ANTLR 3.1.3 (Java/Scala).
From what I see CommonTokenStream.fillBuffer() is pretty greedy and
loads all tokens at once. Right now I'm using ANTLRInputStream as the
CharStream.
Is there a (simple) way to accomplish this? What would be the right
approach: a custom Token Stream or rather another Char Stream? BTW:
Lookahead of 1 is fine for me.
Thanks for your help.
Cheers,
Horst
More information about the antlr-interest
mailing list