[antlr-interest] Parsing structured data rather than language
Andrew Lentvorski
bsder at allcaps.org
Wed Jul 25 13:24:16 PDT 2007
Randall R Schulz wrote:
> Dare I again bring up the question of whether it's sensible to use ANTLR
> or any similar tool to parse XML?
Ummm, I'm not parsing XML. I'm parsing a VCD (Verilog Change Dump) file.
However, your question begets a larger question:
Is ANTLR a good way to parse and process structured data?
*That* I don't know.
Dealing with structured data sucks. Even with XML, you wind up writing
the equivalent of a top-down recursive descent grammar. If you look at
the format of a VCD file, it also requires the equivalent of a recursive
descent grammar to soak it all up.
The problem, however, is tokenization. The tokenization is very
non-specific with a combination of delimiters serving to set off
tokenization class changes.
The VCD format, in particular, sometimes uses explicit delimiters
($comment ... $end), but it overloads the end delimiter ($date ...
$end). Sometimes it uses whitespace "1x0z01 identifier" but sometimes
it uses character length where the 1 is the value and anotherid is the
identifier "1anotherid".
The problem is that I don't get to *write* these interchange formats.
I'm stuck with them. I have to beat the tokenizer into submission.
Once I get the tokenizer to behave, normally the grammar is straightforward.
Maybe this means that I should just pull the grammar out of ANTLR and
use a handwritten custom tokenizer. I don't know. However, I'd at
least like to try this while staying completely within the framework of
ANTLR. Afterward, I can try the custom tokenizer and see if it reduces
the complexity substantially or not.
-a
More information about the antlr-interest
mailing list