[antlr-interest] Retaining comments

Wed Mar 12 09:12:46 PDT 2008

I would need to see your code to comment on why you feel that your XML generator is 5X slower than libxml, but do bear in mind the following:

1)      ANTLR is for recognition of the things you specify via grammar, and the xml library is hand crafted to go straight at internal structure, hence it is likely using things like memchr to locate the next spot in the input it is concerned with and so on. This isn't a defense of the C runtime by the way, it should be obvious that when you know exactly what the input looks like, you can hand craft something that will be faster than a generic program;

2)      You are generating the XML after reading in and lexing, parsing ;-) Hence it isn't quite a fair competition unless you are going to get libxml2 to do something equivalent J

3)      You might try the optimized scanning in the C runtime if you can pre-know the input encoding is 8 or 16 bits (I will provide a UTF32 version before too long). This skips calling a function to find the next input character and instead maintains a pointer into the input stream. See the Java parsing example in the downloads.

4)      If you are using the built in string stuff, then note that you don't have to do that. It is generic and so when you ask for the text, it creates some memory and so on. If you are in fact building up a long output string, then you could easily write a function that given a pCOMMON_TOKEN, memcpy's the input text directly where you want it to go. This is a big advantage in ANTLR 3 as you are not forced into incurring penalties you do not need. Otherwise you are probably asking for the $text of a token, which is going to copy it from the input stream then appending it some buffer. There is no need for the intermediate step.

5)      Make sure you are not using expensive predicates and backtracking of course.

6)      If you want me to take a look, feel free to send me the code off list.

So in short, I am sure that you can improve your performance quite a lot (perhaps using a profiler will help you here), but you probably can't compete with something that isn't really tokenizing anything if it is written well. Loring is correct in his assessment of XML processors in general though, according to my own experience, but that doesn't mean that there are not some good ones around ;-)

My other thought here is that to accurately gauge a 5X performance with just your own 'gut feel' you would have to be parsing an input stream that is pretty huge. My own parsers don't take long enough for me to measure anything with a single input stream, it is just 'done' almost as soon as it is called in terms of human time spans. Are you by any chance traversing an input directory, where you must read the file from the filesystem, but then calling libxml2 with an in memory string? You might try breaking out the time for each component. For instance, when testing the C runtime against the Java runtime, I was parsing the whole of JDK 1.6. In the end, the time taken to get the files off disk was more than the parsing and until I found a way to remove this as a bottle neck, the C parser was only looking about 50% faster as most of the time was reading from the file system for both Java and C. 

 I will point out that while associating comments with code seems to be an easy task, you soon find out that it is almost impossible to get all the associations correct unless you are starting out with some pre-canned rules for what you are going to see, such as how doxygen works. There is a lot of information on this stuff elsewhere of course.

Anyway, I am happy to hear you have been successful building your parser/toolset!

Jim

From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Stuart Watt
Sent: Wednesday, March 12, 2008 7:46 AM
To: 'Loring Craymer'; Stuart Watt; Terence Parr; bmeike at speakeasy.net
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Retaining comments

Yep, I see this, and I'm certainly no fan of XML from a user's perspective. I agree 110% with the rant. XML is not written for humans, nor should people have to read/write it. I have ranted myself against the semantic web community in much the same (they seem to imagine people can write not only XML, but RDF!!) But as a way of annotating text flexibly, I can't see much wrong with it. And I can't see why a parse tree of some kind (ideally with lexing information) at the least cannot be overlaid on the text in this manner. I suppose the question is: do you need to maintain association with the underlying text? If not, an AST is fine. If so, why not use a flexible text markup language. The comment issue seems (to me) to fit into the category of maintaining association with the underlying text. And it is not that hard - as far as I can tell, ANTLR could be made to do it just fine in a couple of hundred lines of code, in the tree writer. 

Performance doesn't seem to match what I've been getting. I can slurp in immense amounts of XML fast using libxml2 - this is certainly (subjectively) quicker than parsing using ANTLR and the C runtime. However, I did choose libxml2 carefully (I led a project on the INEX XML information retrieval competition, and needed something capable of large-scale fast XML, as we were indexing the whole of Wikipedia). I've not benchmarked it, but my C parser feels about 5 times slower than reading the XML I generate from its back end. 

It is the "horizontal problems" I'm actually working on, trying to integrate information from multiple interlocking source languages, many of which are (a) sloppily defined (C, C++, COBOL:, etc.) and (b) which embed each other in all sorts of interesting ways. This includes not only obvious stuff (<script> in HTML) but far less obvious ones - C strings that happen to be SQL queries. It is as if everything can be delegated (the obvious solution) - but this raises the interesting challenge of how to make sense of stuff that comes from multiple source languages, even within a single processing unit. However, if a group of people have written fast code for searching trees for patterns (this is not walking the tree, but searching it) this is useful technology to me. As an ex-IR person maybe my roots in the XML community are making me see things differently. 

It may be that the task I am trying to solve involves a weak version of a grammar. Quite simply, I do not need to parse everything. I do need to tokenise everything, in some form, but rarely do I need a complete tree. With some languages (C, C++, COBOL:, etc.) parsing is a big enough challenge on its own that I'd rather not bother. For the processing I am doing, I do not care about control structures, for example. The C parser I developed (successfully in ANTLR) is skeletal to the point of not classifying most statements, but tracks nesting of parentheses, braces, etc. 

XML does not equate to making huge files out of things. SAX is a perfectly good way of producing output, and being stream/event oriented, is great for filtering tasks. The design I've been playing with is a generic event-stream-oriented approach to parsing. That way I can (and do) stuff gigabytes of text into the parsing systems without actually writing them into files. 

Consider the following scenario. Imagine I want to annotate C into HTML, making function calls into hyperlinks, and maybe a few other things, such as colorising comments. Two tasks: recognising function calls, and annotating the original text. To do this, the original text needs to be preserved, and the text parsed in some way. This feels like a non-trivial task, but surely it is not that unusual. It feels to me that an XML parse tree annotation on the original text would make this a trivial task, for what ought to be a modest additional to ANTLR. Obviously, if you felt strongly you could do it directly, but there are good (separation of concerns) reasons for not tangling the styling details into the parser, and XML would allow this kind of separation. 

All the best

Stuart

-----Original Message-----
From: Loring Craymer [mailto:lgcraymer at yahoo.com]
Sent: Wednesday, March 12, 2008 12:42 AM
To: Stuart Watt; Terence Parr; bmeike at speakeasy.net
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Retaining comments

You can do XML and DOM--ANTLR 2 had an AST serializer built in--but there is not much point to doing so other than that you have some familiarity with the tools.  For any vertical translation problem (one language to translate), ANTLR will be faster (XML processing is _slow_ from a machine perspective), more powerful, and easier to use if you learn how to use ANTLR effectively.  There are horizontal problems--extracting information from a collection of trees generated by different source languages and different translators--for which XML is usable, but again this is not the way to go if you are comfortable with language processing technology.

The value of XML is that it is an agreed upon format for structured text that is portable and can be adapted for general information retrieval ("the semantic web")--or at least has that as a hoped for goal.  It is not a technology for language processing; indeed, the XML community seems to be almost allergic to language processing technology.  "Everything is a  tree" does not remove the need for grammars--the XML community calls them "schema" and writes applications in XSLT to convert from one schema to another without intermediate analysis.

You might also take a look at Ter's rant on XML, http://www.ibm.com/developerworks/xml/library/x-sbxml.html.

--Loring

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080312/46fa48a0/attachment-0001.html