[antlr-interest] Data value is field length

Fri Jan 23 09:20:50 PST 2009

joel at mentics.com wrote:
> So, if I have a binary field, the first byte of which indicates how
> long the field is, is there a way to do this in ANTLR?
>
> The Lexer would have to get the first byte, look at its value, read
> that many more bytes, and that would be the end of that field.
>
> Any ideas on how this might best be done in ANTLR?
You might need a custom input stream that has some base knowledge of the 
stream. However, in lexer actions, you have access to the input stream 
via 'input' and assuming that you can encode the start of such a token 
in lexer rules, all you need to do is write custom code to 
input.consume() as many 'characters' as you need. However, you have not 
said what the target language is, I have had to presume Java. Make sure 
that you set the encoding on your input stream such that you read 8 bit 
binary characters and not re-interpret the stream as UTF8 or something!

To be honest, if your binary data is in some fixed format, then ANTLR 
might even be overkill, but if it has a fairly complex structure, then 
writing a custom input stream that rewrites the input stream in to an 
easier form could be an approach. But if you can infer the structure in 
a simple read through the data then you don't need a parser anyway ;-). 
However, suppose you just have a couple of easily identifiable binary 
points in a bigger structure, your input stream looks for say 0xFF and 
it knows that in any context whatsoever that this means the next two 
bytes are a 16 bit length, then that many bytes are binary. You could 
just have it rewrite this bit as: BINARY{nnnn, 0xXX, 0xXX, 0xXX ..} or 
some other form that the lexer can deal with no problem. But again, if 
it is always 0xFF then you can use a lexer rule and input.consume(), 
something like this:

BINARY : '\u00FF'
     {
          int bytes = input.consume(); // Assuming 8 bit input, but you 
can find the length one way or another
          for (int i = 0; i<bytes; i++) { input.consume(); }
     }
;

Hopefully that gives you enough info to determine what you r best 
approach is for the dataset :-)

Jim