[antlr-interest] Natural language parsing

Tue Jan 8 14:44:07 PST 2008

[On with my cognitive scientist hat!]

Grammar less important than frequency may be. Start with verbs sentences
may. 

(Yes, I know it ends up sounding like Yoda.) Human language understanding
research has not lately followed the lexical -> grammatical -> semantic
pipeline path for all cases. A lot depends on the purpose of the
application. Information retrieval typically ignores all grammatical
structure in favour of word frequencies, because that supports robust
search, and if you have a decent query (and a large collection) to start
with, it's enough. If you need to map semantics, or are working with small
collections, these techniques are far less useful. Frequencies are handy,
but essentially they are heuristics of likelihood of information bearing.
However, people may be cued to process words "out of band" using other
tricks, like SWITCHING TO UPPER CASE (which you probably read before other
words without intending to) although grammatically this may not fit. 

I've been working with an eye-tracker, to explore how people read texts, and
they generally don't follow a wholly linear pattern. Some words are skipped
entirely, and there is a tendency to latch onto significant words, and even
step back to reprocess others in a new context, if time and resources
permit. Hence, many people entirely miss grammatical errors, and are
unbelievably robust in the face of language errors of all kinds. (Robustness
is why IR looks at work frequencies so much - it is a very reliable approach
in the face of errors.) The key is to think of the reader as an information
processor, trying to do a particular task. The nature of the task, how much
time they have, the source of the text, etc., all influence the strategies
they use. 

One interesting model is the predictor-substantiator style, developed by De
Jong in FRUMP. FRUMP consists of a two components: a predictor (designed to
make guesses about what the text is saying) and a substantiator (that looks
for evidence about those guesses). The two operate cyclically with lots of
backtracking. This is kind of like a *very* general parser, except that it
was originally intended to directly construct a semantic model not a
syntactic one, and it can more or less move to any token at any point in
processing. So, most standard parsers and grammars are like
easy-to-construct versions of this framework. Easy-to-construct is good,
most predictor-substantiator NLP systems are mammoth efforts in fairly
limited domains, yet they achieve a balance between the robustness of IR and
depth-of-processing of a grammar. 

The great thing about a grammar is that (especially in combination with a
part-of-speech tagger for term classification) it can get you a long way
quickly, especially with backtracking. Certainly, if I needed to build an
NLP system to extract some sort of meaning from texts in a limited domain on
a time budget, I'd start with some kind of grammar, even if it isn't
necessarily the right thing conceptually. However, the more robustness I
needed to achieve, the more I'd have to bend its rules, and there might come
a point where I ended up with something that didn't look much like a grammar
in any traditional sense. 

All the best
Stuart

-----Original Message-----
From: Andy Tripp [mailto:antlr at jazillian.com]
Sent: Tuesday, January 08, 2008 4:30 PM
To: Terence Parr
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Natural language parsing

Terence Parr wrote:
>
> ANTLR could only handle a limited deterministic subset rather than 
> full NLP and couldn't help in that area.  I'm just saying that 
> grammatical structure is key to NLP.  Word freq don't cut it.  I'm 
> paraphrasing Steven Pinker, a human language expert from some 
> fancy-pants school back east. :)
My understanding is that the grammatical structure and word frequencies 
are all intertwined, too. So when you look for the verb in "Woods Eyes 
Masters", you might see that "eyes" is used as a verb less often than 
"masters" is, yet that's offset by the fact that sentences almost never 
end with a verb. And even then, if "Woods" doesn't turn out to be a noun 
which can perform the "eyes" action (as determined by word frequency), 
then we might backtrack and decide that "masters" is the verb after all.

All in all, NLP seems like a total crapshoot compared to parsing 
programming languages. Heck, even C++ and COBOL have SOME rules that 
come close :)

Andy