[antlr-interest] trying to parse retrievalware search queries

Gerald Halstead standard.jack at yahoo.com
Fri Aug 24 11:57:02 PDT 2007


I trying to translate queries written for the commercial retrievalware search engine
into the queries for the open source Lucene search engine.

The best retrievalware documentation I can find is online is at:

    http://www.ant.kiev.ua:7273/Work8/helppages/ru/Boolean.htm

and

    http://www.ant.kiev.ua:7273/Work8/helppages/ru/Wildcards.htm

which are summarized below:



Operator      Syntax                         Precedence  Description
--------      ------                         ----------  -----------

()            (word1 or word2) and word3     1           Override precedence of other operators.
                                                         Can be nested to any depth.
                                                         
not           not word                       2           word must not be in doc.             
^             ^word

and           word1 and word2                3           Both word1 and word2 must be in doc.     
  
&             word1 & word2                              This is the default operator.
but           word1 but word2

within        word1 word2 within N           4           word1 must be within N words of word2.

adj           word1 word2 adj N              4           same as "within" except word1 precedes
word2.

between       word1 between word2 and word3  4           word1 between word2 and word3
inside        word1 insdieword2 and word3    4           word1 between word2 and word3

or            word1 or word2                 5           Either word1 or word2 must be in doc.
|             word1 | word2                  5           Either word1 or word2 must be in doc.


Apparently the within syntax allows for an "AND" as follows:

    network AND security WITHIN 1

    (general electric WITHIN 3) AND (westinghouse electric WITHIN 3) WITHIN 40

Operators may be all lower or all upper case.
And exact phrase is enclosed in double quotes.

Words may contain the following wildcards:

Wildcard Description                              Example
-------- -----------                              -------

*        Match anything or nothing.               pharma*

?        Match exactly one character.             la?er

_        Match one or no character.               colo_r

@        Match exactly one alphabetic character.  c at er

#        Match exactly one numeric character.     #600

\        Wildcard escape.                         joe\@home

^        Match any character except the next one. 199[^7]

[ ]      Match one character from set.            [aeiou]
                                                  A[1-5]



Lucene query syntax:

    http://lucene.apache.org/java/docs/queryparsersyntax.html


I realize that a precise translation is impossible.

I'm not a grammar meister and could use some help creating a antlr
grammar for retrievalware.  Here's the dysfunctional grammar
I've created so far:





grammar Rware;

@header {
package test;
import java.util.HashMap;
}

@lexer::header {package test;}

@members {
/** Map variable name to Integer object holding value */
HashMap memory = new HashMap();
}

query
	: orExpression
	;
	
orExpression
	: hackExpression (('or' | 'OR' | '|') hackExpression)*
	;
	
hackExpression
	: withinExpression
	| adjExpression
	| betweenExpression
	| andExpression
	;

withinExpression
	: primary ('and' | 'AND')? primary ('within' | 'WITHIN') INT
	;
	
adjExpression
	: andExpression andExpression ('adj' | 'ADJ') INT
	;
	
betweenExpression
	: andExpression ('between' | 'BETWEEN' | 'inside' | 'INSIDE') andExpression  ('and' | 'AND')
andExpression
	;

andExpression
	: unaryExpression (('and' | 'AND' | 'but' | 'BUT' | '&')? unaryExpression)*
	;
	
unaryExpression
	: ('not' | 'NOT' | '^') unaryExpression
	| primary
	;
	
parExpression
	: '(' orExpression ')'
	;	
	
primary
	: parExpression
	| WORD
	| STRING_LITERAL
	;

WORD
	: LETTER+
	;

STRING_LITERAL
	:  '"' ( EscapeSequence | ~('\\'|'"') )* '"'
	;
    
fragment
EscapeSequence
	:   '\\' ('!'|'~'|'?'|'*'|'_'|'@'|'#'|'^'|'\"'|'\''|'\\')
	;    

fragment
LETTER
	:	'$'
	|	'A'..'Z'
	|	'a'..'z'
	|       '0'..'9'
	|	'_'
	;
    
INT
	: ('0' | '1'..'9' '0'..'9'*)
	;

WS
	:  (' '|'\r'|'\t'|'\u000C'|'\n') { skip(); }
	;




       
____________________________________________________________________________________
Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/


More information about the antlr-interest mailing list