[antlr-interest] Nested parsing

Fri Mar 12 14:45:25 PST 2004

In the past, I've successfully used multiple lexers in ANTLR with a
single parser, to parse documents (as in the javadoc comment example
that comes with ANTLR).

But now, I am trying to figure out how best to design multilevel
parsers, while maintaining the proper location information suitable
for error reporting, etc.  There are a couple of different examples of
things I need to do, but I will just give one example right now:

Let's say I am given the task of parsing an XML document (supposedly
conforming to a schema) that has embedded text that I need to also
parse, e.g. (stripping out a lot of relevant stuff):

<document>
<turn>
Hi <pause/> there.  Say something?
</turn>
<turn>
One.  Two.  Three.
</turn>
</document>
My task is to parse, do some rearrangement and computation, and spit
out a new XML document, conforming to a different schema, e.g:

<d>
<t>
<u><w>Hi</w><p/><w>there</w><period/></u>
<u><w>Say</w><w>something</w><question/></u>
</t>
<t>
<u><w>One</w><period/></u>
<u><w>Two</w><period/></u>
<u><w>Three</w><period/></u>
</t>
</d>

Currently, I'm doing this task as follows:
- Parse XML, with validation, using Sun JAXB data binding
- Walk the tree, doing stuff, and each time hitting a "turn", manually
  concatenate text and also invent special tokens (corresponding to
  stuff like "pause")
- Pass the text to an ANTLR-generated parser which turns it into JAXB
  tree fragments
- Serialize the tree into an XML file

One problem is that slurping XML into a tree does not create locator
information, so ANTLR parsing error messages are hard to correlate
back to the original XML file.

I was thinking of replacing use of JAXB for parsing in the first step,
using SAX instead, and getting location information that way, passing
it to the ANTLR parser.  But SAX is brutally low level and I lose the
advantage of just automatically getting a tree.

Another possibility is to use XTA for ANTLR.  This seems to require
hand-translating the schema into an ANTLR grammar.  In addition, I'm
not sure from the documentation how XTA would interact with an attempt
to nest a parser within it (and share location information).  The
samples use the "PCDATA" token to slurp up text content of XML.

Any ideas on an elegant architecture for my transformation task?

-- 
Franklin

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/