[antlr-interest] Multi-phase tree rewriting question

Dukie Banderjee dukie_banderjee at hotmail.com
Fri Jun 19 09:34:29 PDT 2009


Hi,

I'm considering how to solve the following problem: The documents I'm parsing (EDI) are actually wrapped in 'envelopes'. Similar to XML, the envelopes have a beginning line and an ending line. The lines in between are the actual documents which are wrapped.

However, different wrapped documents have different formats, so my current idea is to have a grammar for each document format, and an 'outer' grammar for the enveloping structure.

The envelope grammar knows about its beginning lines and ending lines, and the fact that there are lines in between, but it does not parse the contained lines into a hierarchical AST; that job is for the specific document parser.

So, the idea is that I first parse a file into an envelope AST which has this basic structure:

^(ENVELOPE ...
  ^(GROUP ...
    ^(DOCUMENT type
      ^(LINE ...)  // Straight list of lines
      ^(LINE ...)
      ...
    )
    ...
  )
  ...
)

Then the next phase would be to rewrite the DOCUMENT AST to add some structure to it, based on the EDI standards:

    ^(DOCUMENT type

      ^(LINE ...)  // Straight list of lines

      ^(LINE ...)

      ...

    )


becomes:

    ^(DOCUMENT type

      ^(HEADER_INFO ...)  // More-detailed AST

      ^(ITEMS
        ^(ITEM
          ^(DETAIL ...)   // Hierarchical structure added


          ^(DETAIL ...)
          ...
        )
      ...
      )


      ...

    )


A major reason to go with this approach, I think, is that there are literally hundreds of different types of EDI documents (although I'm only concerned with parsing fewer than ten of these types), but the enveloping structure is exactly the same, regardless of the contained documents.

Also, there are different versions of the EDI standards, and so I would need slightly different parsers for different versions of the same document type. For example, an Edifact DELFOR document may be based on the 96a standard, or the 07b standard, or whatever, with slightly different (sometimes largely different) structures. So, I could have Delfor96aParser, and Delfor07bParser, and the information returned by the envelope grammar would allow me to select which version of EDI is being used, and therefore which document-parser to invoke.

So, if I had this envelop parser, which can generate a document-AST which is a straight list of lines, how would I then further process this AST to add some structure to it, using a document-specific parser?

I was looking at ANTLR's 'rewrite' option, but from what I've read it seems to only work with template output. I don't want to output text, I want to rewrite AST to AST, and then process the AST later. There is no text output needed.

Would I use a tree-walker? If so, wouldn't this require creating a whole new copy of the document-AST? Is this the right approach? Or is an in-place rewrite more appropriate? If so, how do I do it?

Any help would be appreciated, even if its just pointing me to the right place to find the info.

Thanks,
Rob

_________________________________________________________________
We are your photos. Share us now with Windows Live Photos.
http://go.microsoft.com/?linkid=9666047
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090619/675f2d4b/attachment.html 


More information about the antlr-interest mailing list