[antlr-interest] Data value is field length

Fri Jan 23 09:44:34 PST 2009

Excellent information. That's exactly what I was hoping that I could
consume the input like that.

One more question: What if this binary field happens in the middle of
UTF16 stream?

Actually, I can provide my own input stream to ANTLR, right? So, I
could provide an input stream that can provide either char or bytes,
so this code could cast to my custom input stream, read the bytes as
needed, and then it would continue in whatever encoding it was
supposed to be after this field.

And to answer some of your questions: yes java, and yes the location
of the field is pretty easy to identify, so that should not be much of
an issue. We're considering ANTLR for attacking various data which has
a wide range of structural complexity. Since some may have these types
of fields, I'm making sure we can do everything in ANTLR in a
straightforward and general manner.

Thanks!

On Fri, Jan 23, 2009 at 9:20 AM, Jim Idle <jimi at temporal-wave.com> wrote:
> joel at mentics.com wrote:
>> So, if I have a binary field, the first byte of which indicates how
>> long the field is, is there a way to do this in ANTLR?
>>
>> The Lexer would have to get the first byte, look at its value, read
>> that many more bytes, and that would be the end of that field.
>>
>> Any ideas on how this might best be done in ANTLR?
> You might need a custom input stream that has some base knowledge of the
> stream. However, in lexer actions, you have access to the input stream
> via 'input' and assuming that you can encode the start of such a token
> in lexer rules, all you need to do is write custom code to
> input.consume() as many 'characters' as you need. However, you have not
> said what the target language is, I have had to presume Java. Make sure
> that you set the encoding on your input stream such that you read 8 bit
> binary characters and not re-interpret the stream as UTF8 or something!
>
> To be honest, if your binary data is in some fixed format, then ANTLR
> might even be overkill, but if it has a fairly complex structure, then
> writing a custom input stream that rewrites the input stream in to an
> easier form could be an approach. But if you can infer the structure in
> a simple read through the data then you don't need a parser anyway ;-).
> However, suppose you just have a couple of easily identifiable binary
> points in a bigger structure, your input stream looks for say 0xFF and
> it knows that in any context whatsoever that this means the next two
> bytes are a 16 bit length, then that many bytes are binary. You could
> just have it rewrite this bit as: BINARY{nnnn, 0xXX, 0xXX, 0xXX ..} or
> some other form that the lexer can deal with no problem. But again, if
> it is always 0xFF then you can use a lexer rule and input.consume(),
> something like this:
>
> BINARY : '\u00FF'
>     {
>          int bytes = input.consume(); // Assuming 8 bit input, but you
> can find the length one way or another
>          for (int i = 0; i<bytes; i++) { input.consume(); }
>     }
> ;
>
> Hopefully that gives you enough info to determine what you r best
> approach is for the dataset :-)
>
> Jim
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>