[antlr-interest] Retaining comments

Wed Mar 12 07:46:27 PDT 2008

Yep, I see this, and I'm certainly no fan of XML from a user's perspective.
I agree 110% with the rant. XML is not written for humans, nor should people
have to read/write it. I have ranted myself against the semantic web
community in much the same (they seem to imagine people can write not only
XML, but RDF!!) But as a way of annotating text flexibly, I can't see much
wrong with it. And I can't see why a parse tree of some kind (ideally with
lexing information) at the least cannot be overlaid on the text in this
manner. I suppose the question is: do you need to maintain association with
the underlying text? If not, an AST is fine. If so, why not use a flexible
text markup language. The comment issue seems (to me) to fit into the
category of maintaining association with the underlying text. And it is not
that hard - as far as I can tell, ANTLR could be made to do it just fine in
a couple of hundred lines of code, in the tree writer. 

Performance doesn't seem to match what I've been getting. I can slurp in
immense amounts of XML fast using libxml2 - this is certainly (subjectively)
quicker than parsing using ANTLR and the C runtime. However, I did choose
libxml2 carefully (I led a project on the INEX XML information retrieval
competition, and needed something capable of large-scale fast XML, as we
were indexing the whole of Wikipedia). I've not benchmarked it, but my C
parser feels about 5 times slower than reading the XML I generate from its
back end. 

It is the "horizontal problems" I'm actually working on, trying to integrate
information from multiple interlocking source languages, many of which are
(a) sloppily defined (C, C++, COBOL:, etc.) and (b) which embed each other
in all sorts of interesting ways. This includes not only obvious stuff
(<script> in HTML) but far less obvious ones - C strings that happen to be
SQL queries. It is as if everything can be delegated (the obvious solution)
- but this raises the interesting challenge of how to make sense of stuff
that comes from multiple source languages, even within a single processing
unit. However, if a group of people have written fast code for searching
trees for patterns (this is not walking the tree, but searching it) this is
useful technology to me. As an ex-IR person maybe my roots in the XML
community are making me see things differently. 

It may be that the task I am trying to solve involves a weak version of a
grammar. Quite simply, I do not need to parse everything. I do need to
tokenise everything, in some form, but rarely do I need a complete tree.
With some languages (C, C++, COBOL:, etc.) parsing is a big enough challenge
on its own that I'd rather not bother. For the processing I am doing, I do
not care about control structures, for example. The C parser I developed
(successfully in ANTLR) is skeletal to the point of not classifying most
statements, but tracks nesting of parentheses, braces, etc. 

XML does not equate to making huge files out of things. SAX is a perfectly
good way of producing output, and being stream/event oriented, is great for
filtering tasks. The design I've been playing with is a generic
event-stream-oriented approach to parsing. That way I can (and do) stuff
gigabytes of text into the parsing systems without actually writing them
into files. 

Consider the following scenario. Imagine I want to annotate C into HTML,
making function calls into hyperlinks, and maybe a few other things, such as
colorising comments. Two tasks: recognising function calls, and annotating
the original text. To do this, the original text needs to be preserved, and
the text parsed in some way. This feels like a non-trivial task, but surely
it is not that unusual. It feels to me that an XML parse tree annotation on
the original text would make this a trivial task, for what ought to be a
modest additional to ANTLR. Obviously, if you felt strongly you could do it
directly, but there are good (separation of concerns) reasons for not
tangling the styling details into the parser, and XML would allow this kind
of separation. 

All the best
Stuart

-----Original Message-----
From: Loring Craymer [mailto:lgcraymer at yahoo.com]
Sent: Wednesday, March 12, 2008 12:42 AM
To: Stuart Watt; Terence Parr; bmeike at speakeasy.net
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Retaining comments

You can do XML and DOM--ANTLR 2 had an AST serializer built in--but there is
not much point to doing so other than that you have some familiarity with
the tools.  For any vertical translation problem (one language to
translate), ANTLR will be faster (XML processing is _slow_ from a machine
perspective), more powerful, and easier to use if you learn how to use ANTLR
effectively.  There are horizontal problems--extracting information from a
collection of trees generated by different source languages and different
translators--for which XML is usable, but again this is not the way to go if
you are comfortable with language processing technology.

The value of XML is that it is an agreed upon format for structured text
that is portable and can be adapted for general information retrieval ("the
semantic web")--or at least has that as a hoped for goal.  It is not a
technology for language processing; indeed, the XML community seems to be
almost allergic to language processing technology.  "Everything is a  tree"
does not remove the need for grammars--the XML community calls them "schema"
and writes applications in XSLT to convert from one schema to another
without intermediate analysis.

You might also take a look at Ter's rant on XML,
http://www.ibm.com/developerworks/xml/library/x-sbxml.html
<http://www.ibm.com/developerworks/xml/library/x-sbxml.html> .

--Loring

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080312/558e8d3f/attachment.html