[antlr-interest] please help ... I need to parse a paper ...

Thu Mar 16 20:30:04 PST 2006

I did something like this a few years back with PCCTS (ANTLR 1).  As Martin
points out, English keywords are not enough to handle the general case,
since the same words may appear in the body of the text.

What helps is to take advantage of formatting information in the document.
I used RTF (Microsoft's rich text formatting), but HTML or any other
formatting language would work as well.  Only a subset of the formatting
keywords need be recognized:  you only need to recognize formatting
information that would be coupled with a title (title such as   "Abstract").
Other keywords can be ignored in the lexer.

You end up with two versions of many rules:  one version recognizes the
textual keywords like "Abstract" or "Title" and a second that recognizes
generic text.  These are distinguished with syntactic predicates (ANTLR 3
may let you get by without them).  I also used syntactic predicates to
recognize textual keywords; predicate hoisting was an essential.  RTF
keywords were handled using ugh a machine-generated symbol table (I
converted a list of keywords which took arguments or had a specific type,
like "header", that needed to have their token type values changed in the
lexer; the symbol table was used inside of semantic predicates and
subsequent actions).

This is a problem whose solution depends heavily on predicated parsing;
solving it by hand would be a gruesome experience.  With ANTLR, it was not
that difficult, although I did run into annoying places where the RTF that I
was parsing deviated from the Microsoft spec.

--Loring

  _____  

From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of
enriquebris at cimex.com.cu
Sent: Thursday, March 16, 2006 11:39 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] please help ... I need to parse a paper ...

Hello,,,

I'm trying to parse a paper. The document format is more or less like the
following:

Title

Authors

Abstract

Keywords

Introduction

.

References

First of all, I'm trying to get the text before ABSTRACT . I have a lexer
rule -->  ABSTRACT_WORD : "abstract"; and in the parser --> abstract :
(~ABSTRACT_WORD) but it doesn't work as I expect. The thing is: I want to
obtain all the text before the word ABSTRACT (ABSTRACT_WORD token), and the
text between ABSTRACT_WORD (token) and INTRODUCTION_WORD (Also I have a
INTRODUCTION_WORD declared in the lexer). Please,,, can anybody suggest me
an idea to solve this ?

Enrique

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20060316/0c47ca7d/attachment.html