[antlr-interest] Streaming Support

Mon May 17 07:01:27 PDT 2010

Hi there,

I'd like to parse 'large' streams (200 MBytes and more) with only  
small chunks of data (tokens/characters) in memory at a time.
The goal is that the parser/lexer should block until more chars are  
available from the underlying input stream. I have a few simple  
'callbacks' embedded in the grammar which call into the business logic  
to process recognized data. But with the standard setup, the callbacks  
are just called after the complete input stream was read.

// uncompressed replication transaction.
transaction
   : { if(callback != null) callback.startTransaction(); }
     x01 (update_type)+
     { if(callback != null) callback.endTransaction(); }
   ;

update_type
   : entityId = entity_id '{' (values = basic_update)+ '}'
     { if(callback != null) callback.updateType(entityId, values); }
   ;

basic_update returns [List<String> values]
@init {
   values = new ArrayList<String>();
}
   : '{' s = value { values.add(s); } ('|' (s = value  
{ values.add(s); } )? )* '}'
   ;

There are a few reasons why I'd like to do it this way:

1. the data is received in rather small chunks (< 4k or so) from NIO  
sockets
2. I don't want to buffer the data on the file system (file I/O)
3. have as small a memory footprint as possible
4. it is possible that many streams are processed/parsed at one time

I'm using ANTLR 3.1.3 (Java/Scala).

 From what I see CommonTokenStream.fillBuffer() is pretty greedy and  
loads all tokens at once. Right now I'm using ANTLRInputStream as the  
CharStream.

Is there a (simple) way to accomplish this? What would be the right  
approach: a custom Token Stream or rather another Char Stream? BTW:  
Lookahead of 1 is fine for me.

Thanks for your help.

Cheers,
	Horst