[antlr-interest] Legal Document Parsing. Can ANTLR help?

Wed Jul 15 08:37:42 PDT 2009

Marco Bagni wrote:
> HI,
>
> I have the need to perform a syntactical parsing of various legal documents
> with the result to identify and extract each article and sub-paragraph.
>
> The documents are text like:
>
> Act. 123 Bla Bla Bla
>
> Art. 1
> (Article title)
>
> Article body with sub paragraph (at most three levels of sub
> paragraph identified by numbers (1, 2, 3...) and letters (a, b,
> c...) and roman literals (i, ii, iii, vi, etc.)
>
> Unfortunately the real life is a bit tougher than this, i.e. in some
> documents you have the string Art. in others Article; sometimes the
> Article title is present sometimes not, and so on.
>
> Do you think that ANTLR can help in generating a parser that identifies
> and extracts the parts of the legal documents labelling  each part with
> the proper hierarchical structure?
>   
I think that a filtering lexer is what you need for this as there is no 
real parsing task here. Though to be honest you might find that awk is 
good for this. Look at the FuzzyJava.g example (see download page).

However;

lexer grammar Articles
options {filter=true;}

@lexer:members {
 int levelCount;
 String levels[10];
 String title;
}

REFERENCE
      : ARTICLE LEVELS TITLE
        {
          System.out.println("Title : " + title);
          // Etc
        }
      ;

fragment ARTICLE
: ('A'|'a') ('R'|'r') ('T'|'t')
     (
         '.'
       | ('i'|'I') ('c'|'C') ('l'|'L') ('e'|'E')
       |
     )
   WS*
    {
         levelCount = 0;
         title = "";
    }

;

LEVELS
   : (LEVEL {levelCount++} )+ ;

LEVEL
 : (
      ('0'..'9')+
    | 'a'..'z'
   )
    { levels[levelCount] = $text; }
   WS*
  ;

TITLE : ~('\n'|'\r')*
               { title = $text; }
      ;

WS : ' ' |  '\t';

Should get you started  - I just typed this in so check it of course ;-).

Jim