[antlr-interest] trying to parse retrievalware search queries
Gerald Halstead
standard.jack at yahoo.com
Fri Aug 24 11:57:02 PDT 2007
I trying to translate queries written for the commercial retrievalware search engine
into the queries for the open source Lucene search engine.
The best retrievalware documentation I can find is online is at:
http://www.ant.kiev.ua:7273/Work8/helppages/ru/Boolean.htm
and
http://www.ant.kiev.ua:7273/Work8/helppages/ru/Wildcards.htm
which are summarized below:
Operator Syntax Precedence Description
-------- ------ ---------- -----------
() (word1 or word2) and word3 1 Override precedence of other operators.
Can be nested to any depth.
not not word 2 word must not be in doc.
^ ^word
and word1 and word2 3 Both word1 and word2 must be in doc.
& word1 & word2 This is the default operator.
but word1 but word2
within word1 word2 within N 4 word1 must be within N words of word2.
adj word1 word2 adj N 4 same as "within" except word1 precedes
word2.
between word1 between word2 and word3 4 word1 between word2 and word3
inside word1 insdieword2 and word3 4 word1 between word2 and word3
or word1 or word2 5 Either word1 or word2 must be in doc.
| word1 | word2 5 Either word1 or word2 must be in doc.
Apparently the within syntax allows for an "AND" as follows:
network AND security WITHIN 1
(general electric WITHIN 3) AND (westinghouse electric WITHIN 3) WITHIN 40
Operators may be all lower or all upper case.
And exact phrase is enclosed in double quotes.
Words may contain the following wildcards:
Wildcard Description Example
-------- ----------- -------
* Match anything or nothing. pharma*
? Match exactly one character. la?er
_ Match one or no character. colo_r
@ Match exactly one alphabetic character. c at er
# Match exactly one numeric character. #600
\ Wildcard escape. joe\@home
^ Match any character except the next one. 199[^7]
[ ] Match one character from set. [aeiou]
A[1-5]
Lucene query syntax:
http://lucene.apache.org/java/docs/queryparsersyntax.html
I realize that a precise translation is impossible.
I'm not a grammar meister and could use some help creating a antlr
grammar for retrievalware. Here's the dysfunctional grammar
I've created so far:
grammar Rware;
@header {
package test;
import java.util.HashMap;
}
@lexer::header {package test;}
@members {
/** Map variable name to Integer object holding value */
HashMap memory = new HashMap();
}
query
: orExpression
;
orExpression
: hackExpression (('or' | 'OR' | '|') hackExpression)*
;
hackExpression
: withinExpression
| adjExpression
| betweenExpression
| andExpression
;
withinExpression
: primary ('and' | 'AND')? primary ('within' | 'WITHIN') INT
;
adjExpression
: andExpression andExpression ('adj' | 'ADJ') INT
;
betweenExpression
: andExpression ('between' | 'BETWEEN' | 'inside' | 'INSIDE') andExpression ('and' | 'AND')
andExpression
;
andExpression
: unaryExpression (('and' | 'AND' | 'but' | 'BUT' | '&')? unaryExpression)*
;
unaryExpression
: ('not' | 'NOT' | '^') unaryExpression
| primary
;
parExpression
: '(' orExpression ')'
;
primary
: parExpression
| WORD
| STRING_LITERAL
;
WORD
: LETTER+
;
STRING_LITERAL
: '"' ( EscapeSequence | ~('\\'|'"') )* '"'
;
fragment
EscapeSequence
: '\\' ('!'|'~'|'?'|'*'|'_'|'@'|'#'|'^'|'\"'|'\''|'\\')
;
fragment
LETTER
: '$'
| 'A'..'Z'
| 'a'..'z'
| '0'..'9'
| '_'
;
INT
: ('0' | '1'..'9' '0'..'9'*)
;
WS
: (' '|'\r'|'\t'|'\u000C'|'\n') { skip(); }
;
____________________________________________________________________________________
Pinpoint customers who are looking for what you sell.
http://searchmarketing.yahoo.com/
More information about the antlr-interest
mailing list