[antlr-interest] Legal Document Parsing. Can ANTLR help?

Wed Jul 15 18:51:25 PDT 2009

I did this sort of thing a number of years back, using PCCTS (ANTLR 1).  At the time, I needed to parse the specification document that defined the spacecraft control language for Cassini.  The trick to parsing structured text is to recognize structure on top of a formatting language.  For the Cassini spacecraft instance, I used RTF as the formatting language.  I first wrote an RTF recognizer, then duplicated many of the rules and added the document structure recognition to the duplicates.  Then I inserted syntactic predicates as in

header :  (structuredHeader)? structuredHeader
             | textHeader
            ;

to distinguish between interesting (structured) features and random text to be ignored.

I also had to do a little work to ignore formatting commands that I had no interest in.  I did that with another very small ANTLR grammar that built a symbol table for the various formatting commands with information on which commands to skip (using sempreds) and which to process.

This approach worked quite well and should work for your problem.

--Loring

----- Original Message ----
> From: Marco Bagni <m.bagni at marcobagni.com>
> To: antlr-interest at antlr.org
> Sent: Wednesday, July 15, 2009 2:26:13 AM
> Subject: [antlr-interest] Legal Document Parsing. Can ANTLR help?
> 
> 
> HI,
> 
> I have the need to perform a syntactical parsing of various legal documents
> with the result to identify and extract each article and sub-paragraph.
> 
> The documents are text like:
> 
> Act. 123 Bla Bla Bla
> 
> Art. 1
> (Article title)
> 
> Article body with sub paragraph (at most three levels of sub
> paragraph identified by numbers (1, 2, 3...) and letters (a, b,
> c...) and roman literals (i, ii, iii, vi, etc.)
> 
> Unfortunately the real life is a bit tougher than this, i.e. in some
> documents you have the string Art. in others Article; sometimes the
> Article title is present sometimes not, and so on.
> 
> Do you think that ANTLR can help in generating a parser that identifies
> and extracts the parts of the legal documents labelling  each part with
> the proper hierarchical structure?
> 
> So far I am doing a prototype in PERL but taking into account all the
> possible variations that can be found in the plethora of documents I have
> to "ingest" it seems a quite cumbersome activity to code all the
> exceptions.
> 
> Thanks for your support.
> 
> Regards
> 
> Marco Bagni
> 
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: 
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address