[antlr-interest] Parsing quoted phrases and non-quoted keywords

Fri Jul 31 13:21:14 PDT 2009

At 04:31 1/08/2009, Scott Van Wart wrote:
 >1) When antlr gives me the quoted string, I lose the whitespace
 >associated with it, which is significant for me only in a quoted 

 >string.  "foo     bar" becomes <">, <foo>, <bar> and <">.  So if 

 >I'm searching, say, a database, and the amount of whitespace is
 >significant in a column (not that this isn't a silly idea), then 

 >I'm out of luck.
[...]
 >    DOUBLE_QUOTE='"';

Remove this.

 >  quoted_value : DOUBLE_QUOTE ( options {greedy=false;} : . )*
 >DOUBLE_QUOTE ;

Make this a lexer rule (QUOTED_VALUE).  See the example string 
rule in the wiki.

 >  NQUOTED_VALUE :    ~( INCLUSION | EXCLUSION | DOUBLE_QUOTE |
 >LEFT_SQB
 >| RIGHT_SQB | ' ' | '\r' | '\t' | '\u000C' | '\n' )* ;

You must at least use + here, not *.  (It's very very bad to 
create a lexer rule that can successfully match zero characters.)

Another alternative here is to just use this instead:

   OTHER: . ;

You can't use a loop, though (without doing something similar to 
what you already had), otherwise it will consume things that you 
want as other tokens as well.  The downside of this is that it 
will generate a token for each character rather than grouping 
them.

You could mitigate this by defining more tokens for specific types 
of things you're expecting (operators, sequences of alphanumeric 
characters, etc).