[antlr-interest] Parsing quoted phrases and non-quoted keywords
Scott Van Wart
scott at indosoft.com
Fri Jul 31 09:31:55 PDT 2009
So it's my first grammar I've written from scratch and I'm having a few
difficulties picking up quoted strings and non-quoted keywords. Here's
a sample input:
Google[+Keyword1 Keyword2 -Keyword3] Yahoo["Phrase 1" +"Phrase 2"
-"Phrase 3"] Keyword4 "Phrase 4"
So hypothetically this could be used as a multi-search-engine search
sort of thing (don't get caught up on it though because it's not at all
what I'm using it for :).
I'm having two issues with it, one of them is functional, the other is
more grammar file maintenance related. (3) is just a bonus question.
I'm hoping these are newbie questions because if they're not I'm
probably over my head and need to do some more reading:
1) When antlr gives me the quoted string, I lose the whitespace
associated with it, which is significant for me only in a quoted
string. "foo bar" becomes <">, <foo>, <bar> and <">. So if I'm
searching, say, a database, and the amount of whitespace is significant
in a column (not that this isn't a silly idea), then I'm out of luck.
2) In order to also catch non-quoted expressions, I have to specify all
terminal tokens in a catch-all rule, surrounded by the negation ~
token. So if I later add a token to the grammar I have to remember to
add it to this catch-all rule as well.
3) Can I safely add the $channel=HIDDEN option to the DOUBLE_QUOTE token
so I don't see those at all?
Here's my grammar, followed by a few more comments:
grammar search;
/* Google[+Keyword1 Keyword2 -Keyword3] Yahoo["Phrase 1" +"Phrase 2"
-"Phrase 3"] Keyword4 "Phrase 4" */
tokens {
INCLUSION='+';
EXCLUSION='-';
DOUBLE_QUOTE='"';
LEFT_SQB='[';
RIGHT_SQB=']';
}
advanced_search : ( attribute_search | keyword_search )+ ;
attribute_label : quoted_value | NQUOTED_VALUE ;
WS : ( ' ' | '\r' | '\t' | '\u000C' | '\n' )+ { $channel=HIDDEN; } ;
/* I took the above from
http://www.antlr.org/wiki/display/ANTLR3/Grammars in the "Lexer Rules"
subsection.
attribute_search : attribute_label LEFT_SQB attribute_value_spec+
RIGHT_SQB ;
keyword_search : quoted_value | NQUOTED_VALUE ;
attribute_value_spec : ( inclusion_exclusion? attribute_value ) ;
inclusion_exclusion : ( INCLUSION | EXCLUSION ) ;
attribute_value : quoted_value | NQUOTED_VALUE ;
quoted_value : DOUBLE_QUOTE ( options {greedy=false;} : . )*
DOUBLE_QUOTE ;
/* The above is based on the comment lexer rule section from the
examples in the "EBNFs" section in
http://www.antlr.org/wiki/display/ANTLR3/Quick+Starter+on+Parser+Grammars+-+No+Past+Experience+Required
*/
NQUOTED_VALUE : ~( INCLUSION | EXCLUSION | DOUBLE_QUOTE | LEFT_SQB
| RIGHT_SQB | ' ' | '\r' | '\t' | '\u000C' | '\n' )* ;
I've tried a number of different things for (2), such as putting .* for
NQUOTED_VALUE, but I get the unmatchable alternative errors (which I
still haven't wrapped my head around). I'm also a little scared to
touch the quoted_value rule definition to try and fix (1) because it's
working pretty well now if I ignore the issue :).
Thanks,
Scott
More information about the antlr-interest
mailing list