[antlr-interest] Easy method of preserving white space in string literals

Mon Oct 22 17:13:34 PDT 2012

NOTE: This post covers the Java target.  Other targets- you're on your own
Just wanted to share something that LOOKS like it's working for me:

I have the typical need to ignore whitespace between all tokens EXCEPT string literals.

So 
    A    +B +    C
Is a perfectly acceptable construct that parses to tokens A, +, B, +, C

But
    "this is a big old string"
Needs to be one token (a string literal) with the spaces preserved.

Here's what I did:

Add the following to the lexer members section:
@lexer::members { 
  boolean in_stringliteral = false; 
}

Modify STRING_LITERAL token definition to set and un-set that value, like so:

STRING_LITERAL    
    : '"' { in_stringliteral=true; } //Set a variable indicating the lexer has begun consuming a string literal 
        ( options{greedy=false;} 
          : ESC_SEQ 
          | STR_CHAR 
        )* 
      '"' { in_stringliteral=false; } //Set a variable indicating the lexer has finished consuming a string literal
    ;

(constructs for ESQ_SEC and STR_CHAR elided for brevity)

Finally, only set the channel to HIDDEN when not consuming a string literal.

WHITESPACE
    :    (' ' | '\t') { if(in_stringliteral==false) $channel=HIDDEN; }
;

I tried doing this with gated predicates, but it just didn't work for me.  The downside is that the interpreter in Eclipse fails when you feed it string literals with spaces in them.

YMMV, but this solved a major headache for me, and I never found a viable solution online.