[antlr-interest] Parsing quoted phrases and non-quoted keywords

Fri Jul 31 09:31:55 PDT 2009

So it's my first grammar I've written from scratch and I'm having a few 
difficulties picking up quoted strings and non-quoted keywords.  Here's 
a sample input:

    Google[+Keyword1 Keyword2 -Keyword3] Yahoo["Phrase 1" +"Phrase 2" 
-"Phrase 3"] Keyword4 "Phrase 4"

So hypothetically this could be used as a multi-search-engine search 
sort of thing (don't get caught up on it though because it's not at all 
what I'm using it for :).

I'm having two issues with it, one of them is functional, the other is 
more grammar file maintenance related.  (3) is just a bonus question.  
I'm hoping these are newbie questions because if they're not I'm 
probably over my head and need to do some more reading:

1) When antlr gives me the quoted string, I lose the whitespace 
associated with it, which is significant for me only in a quoted 
string.  "foo     bar" becomes <">, <foo>, <bar> and <">.  So if I'm 
searching, say, a database, and the amount of whitespace is significant 
in a column (not that this isn't a silly idea), then I'm out of luck.
2) In order to also catch non-quoted expressions, I have to specify all 
terminal tokens in a catch-all rule, surrounded by the negation ~ 
token.  So if I later add a token to the grammar I have to remember to 
add it to this catch-all rule as well.
3) Can I safely add the $channel=HIDDEN option to the DOUBLE_QUOTE token 
so I don't see those at all?

Here's my grammar, followed by a few more comments:

  grammar search;

  /* Google[+Keyword1 Keyword2 -Keyword3] Yahoo["Phrase 1" +"Phrase 2" 
-"Phrase 3"] Keyword4 "Phrase 4" */

  tokens {
    INCLUSION='+';
    EXCLUSION='-';
    DOUBLE_QUOTE='"';
    LEFT_SQB='[';
    RIGHT_SQB=']';
  }

  advanced_search : ( attribute_search | keyword_search )+ ;

  attribute_label : quoted_value | NQUOTED_VALUE ;

  WS : ( ' ' | '\r' | '\t' | '\u000C' | '\n' )+ { $channel=HIDDEN; } ;
/* I took the above from 
http://www.antlr.org/wiki/display/ANTLR3/Grammars in the "Lexer Rules" 
subsection.

  attribute_search : attribute_label LEFT_SQB attribute_value_spec+ 
RIGHT_SQB ;

  keyword_search : quoted_value | NQUOTED_VALUE ;

  attribute_value_spec : ( inclusion_exclusion? attribute_value ) ;

  inclusion_exclusion : ( INCLUSION | EXCLUSION ) ;

  attribute_value : quoted_value | NQUOTED_VALUE ;

  quoted_value : DOUBLE_QUOTE ( options {greedy=false;} : . )* 
DOUBLE_QUOTE ;
/* The above is based on the comment lexer rule section from the 
examples in the "EBNFs" section in 
http://www.antlr.org/wiki/display/ANTLR3/Quick+Starter+on+Parser+Grammars+-+No+Past+Experience+Required 
*/

  NQUOTED_VALUE :    ~( INCLUSION | EXCLUSION | DOUBLE_QUOTE | LEFT_SQB 
| RIGHT_SQB | ' ' | '\r' | '\t' | '\u000C' | '\n' )* ;

I've tried a number of different things for (2), such as putting .* for 
NQUOTED_VALUE, but I get the unmatchable alternative errors (which I 
still haven't wrapped my head around).  I'm also a little scared to 
touch the quoted_value rule definition to try and fix (1) because it's 
working pretty well now if I ignore the issue :).

Thanks,
  Scott