[antlr-interest] Retaining comments

Tue Mar 11 22:41:53 PDT 2008

You can do XML and DOM--ANTLR 2 had an AST serializer built in--but there is not much point to doing so other than that you have some familiarity with the tools.  For any vertical translation problem (one language to translate), ANTLR will be faster (XML processing is _slow_ from a machine perspective), more powerful, and easier to use if you learn how to use ANTLR effectively.  There are horizontal problems--extracting information from a collection of trees generated by different source languages and different translators--for which XML is usable, but again this is not the way to go if you are comfortable with language processing technology.

The value of XML is that it is an agreed upon format for structured text that is portable and can be adapted for general information retrieval ("the semantic web")--or at least has that as a hoped for goal.  It is not a technology for language processing; indeed, the XML community seems to be almost allergic to language processing technology.  "Everything is a  tree" does not remove the need for grammars--the XML community calls them "schema" and writes applications in XSLT to convert from one schema to another without intermediate analysis.

You might also take a look at Ter's rant on XML, http://www.ibm.com/developerworks/xml/library/x-sbxml.html.

--Loring

----- Original Message ----
From: Stuart Watt <SWatt at infobal.com>
To: Terence Parr <parrt at cs.usfca.edu>; bmeike at speakeasy.net
Cc: antlr-interest at antlr.org
Sent: Tuesday, March 11, 2008 12:45:47 PM
Subject: Re: [antlr-interest] Retaining comments

 OK, 
I'm going to have to do this as well. However, my dream would 
be....

Can we 
use/generate an XML AST, with the text nodes corresponding exactly to the input 
source received at the lexer, and the elements corresponding to the AST tags. I 
know there are all sorts of complexities with this, but it enables several 
outcomes:

1. 
Using fast and general tree processing via XML and DOM, maybe even using XPath 
and XQuery
2. 
Easy filtering via the above
3. 
Clear mapping between AST and text, which is not currently 
easy

Although I have not completely looked into this yet (and I will) it seems 
most of this could be done fine using an additional AST writer. I wrote one 
which does the XML, but does not preserve the input text. In the end, I had to 
do this, as the current AST notation (which I wanted to read for processing) was 
unable to distinguish, say, between an imaginary token "FUNCTION" and a 
language identifier written as uppercase "FUNCTION", unless I tagged absolutely 
every single thing in the grammar, which was tedious. There are all sorts of 
other nasty cases (e.g., does whitespace fall inside or outside of particular 
elements). And in particular, this would require some mapping between imaginary 
tokens and text positions which is not always possible. 

I'm 
developing a system which will annotate code, both generating human-readable 
output and a component index. The one pushes you to a text output, the other to 
an AST - I've ended up needing both, largely because of similar issues. It seems 
it may be fairly simple to develop this kind of tree writer for cases like 
these. 

Any 
thoughts on this? Am I crazy/doing it all wrong?

--S
  -----Original Message-----
From: Terence Parr   [mailto:parrt at cs.usfca.edu]
Sent: Tuesday, March 11, 2008 12:43   PM
To: bmeike at speakeasy.net
Cc:   antlr-interest at antlr.org
Subject: Re: [antlr-interest] Retaining   comments

send comments to parser on different channel.    then lookin token buffer for them between "real" tokens.  Ter
    On Feb 27, 2008, at 1:19 PM, <bmeike at speakeasy.net> <bmeike at speakeasy.net>   wrote:

On 
    Wed Feb 27 12:29 , Gavin Lambert 
    sent:
    >       This will keep the comment tokens in the token stream at the 
>       appropriate points. To transfer them you'll have to add some code 
>       that looks for comment tokens nearby recognised parser constructs 
>       so you can emit them at the right place in the output.

Sound       great.  What do you mean by "looks for comment tokens".  As far       as I can tell, the parser only sees the DEFAULT channel.  Where do I       look, to find nearby tokens?

Thanks!
  Blake     Meike

-- 
This message   was scanned by ESVA and is believed to be clean. 
Click 
  here to report this message as spam. 

      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080311/40438ac2/attachment-0001.html