[antlr-interest] Parsing structured data rather than language

Andrew Lentvorski bsder at allcaps.org
Wed Jul 25 13:24:16 PDT 2007


Randall R Schulz wrote:

> Dare I again bring up the question of whether it's sensible to use ANTLR 
> or any similar tool to parse XML?

Ummm, I'm not parsing XML.  I'm parsing a VCD (Verilog Change Dump) file.

However, your question begets a larger question:

Is ANTLR a good way to parse and process structured data?

*That* I don't know.


Dealing with structured data sucks.  Even with XML, you wind up writing 
the equivalent of a top-down recursive descent grammar.  If you look at 
the format of a VCD file, it also requires the equivalent of a recursive 
descent grammar to soak it all up.

The problem, however, is tokenization.  The tokenization is very 
non-specific with a combination of delimiters serving to set off 
tokenization class changes.

The VCD format, in particular, sometimes uses explicit delimiters 
($comment ... $end), but it overloads the end delimiter ($date ... 
$end).  Sometimes it uses whitespace "1x0z01 identifier" but sometimes 
it uses character length where the 1 is the value and anotherid is the 
identifier "1anotherid".

The problem is that I don't get to *write* these interchange formats. 
I'm stuck with them.  I have to beat the tokenizer into submission. 
Once I get the tokenizer to behave, normally the grammar is straightforward.

Maybe this means that I should just pull the grammar out of ANTLR and 
use a handwritten custom tokenizer.  I don't know.  However, I'd at 
least like to try this while staying completely within the framework of 
ANTLR.  Afterward, I can try the custom tokenizer and see if it reduces 
the complexity substantially or not.

-a


More information about the antlr-interest mailing list