[antlr-interest] Binary support

Fri Sep 23 09:57:14 PDT 2011

As there are no further posts right now, I would like to take the
opportunity for a personal conclusion (I admit: It got a longer one ;-).

ASN.1

I took a look at ASN.1 -- This look was really quick, so I might be
wrong on that. ASN.1 experts are welcome to correct me in this case.
I got the feeling that in ASN.1 syntax and encoding are strongly
coupled. I.e. ASN.1 is human readable notation, but you have to take the
encodings provided. This is quite fine for protocols were you're
normally only interested that the encoding is good (compact etc.), but
not how it works in detail, because this is done automatically by
generated code.

That said ASN.1 is not feasible, in my eyes, if you have an already
defined file format and want to generate a parser out of such a ASN.1
grammar.

ANTLR and binary formats

I still think that it would be great if ANTLR would be enhanced to be
able to also parse binary formats. In my eyes it's the right place and would make ANTLR even more unique.
Making ANTLR fit for binary formats would involve following changes:
 1. Enhance capabilities of input handling
 2. Enhance ANTLR grammar
 3. Enhance code generator of ANTLR

For 1.: In the end effect ANTLR does already binary file format handling. In that moment ANTLR reads files in one of the four Unicode encodings (UTF-8, UTF-16 LE, UTF-16 BE, UTF-32) including Byte Order Mark and surrogates support, it lexes a binary format.
Because I don't know ANTLR in detail, I guess here the Sun/Oracle code is used which does this. So ANTLR does this not explicitly, but by usage of the official class libraries. I think here would be some work to be done, but if the Java class libraries are not flexible enough, I'm quite sure that ICU4J will be.

2. and 3. are quite clear: The current ANTLR grammar has currently no support for binary formats, so an extension of some sort would be needed and of course the code generator of ANTLR must also support this.

The last question to discuss is: Is it possible to describe binary formats in a grammar?

I say: Yes, for most of them, this will work. For those it will not work fully, a grammar would at least ease life (you would end up doing the rest using actions etc.). 

In a former post Ron Burk said:

"Binary file formats also often just aren't directly representable by context free grammars. For example, a header may contain offsets of different objects, and the sizes of those objects may have to be inferred from the difference in offsets. Grammars, despite looking seductively similar because of having recursively nested constructs in common, aren't a great match for this domain.

One could imagine useful domain-specific languages for binary file formats, but they might not look quite like grammar tools, and a single language might not be sufficient for all tasks."

I agree and disagree. No matter if they are context free or not: They can be parsed. Binary formats have the benefit, that they were designed to be _machine readable_, and not, like programming languages, _human readable_. In general this makes them easier parsable.

Instead of designing domain specific languages, I would prefer an integration into ANTLR, because there are also file formats out in the wild which combine binary data with text data -- and both needs to be parsed. Having two separate programs is not elegant -- you would end up with a high effort to put binary and text parsing results in one abstract syntax tree.

In my opinion there are typical design patterns often used in binary formats. Offsets as mentioned by Ron in the former post are an example, as well as what I wrote in my first post, section "Interpretation of size":

---------------------------------------------
| header | size of next block | block | ... |
---------------------------------------------

Such patterns could be represented in an expressive syntax.

I think the big issue, which makes binary files different from text files, is their self-referential nature: To be able to read a binary file you have to partially interpret it and use this information to manage the read process. You mostly can't decouple parsing and interpretation. But this is in my opinion no reason to not add such a functionality to ANTLR.

Andi
-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de