[antlr-interest] please help ... I need to parse a paper ...

Martin Probst mail at martin-probst.com
Thu Mar 16 14:44:41 PST 2006


The best way to think about such stuff is probably this:
What happens if you get

Abstract
Hello World, the 
Introduction contains interesting stuff.
Introduction
As mentioned in the
Abstract this is very interesting

To get around this you will have to either impose rules on the authors
(the word Abstract may not appear in text, or at least not at the start
of lines, etc.). Or you will have to lex context dependent, e.g. after
the first "Abstract" token, "Abstract" cannot occur anymore. That
doesn't get you around the problem with the word "Introduction" within
the abstract. Or you could require the whole Abstract/Introduction
whatever text to be on one line ...

Anyways, ANTLR doesn't provide built-in support for line based parsing,
you'd have to hack that into actions yourself. ANTLR (and similar tools)
are best at token based languages like programming languages, where
keywords (like "Abstract" in your case) always mean the same thing
independent of the position in the text.

You might end up with a hand-built solution being easier than using a
parser generator like ANTLR. What about splitting up the text in lines
and checking if the line is "Abstract" and taking all subsequent lines
until one is "Introduction" (or whatever)?

HTH,
Martin




More information about the antlr-interest mailing list