[antlr-interest] Easy method of preserving white space in string literals

Jim Idle jimi at temporal-wave.com
Mon Oct 22 19:31:58 PDT 2012


I am not sure why you could not find an online example of this because as
far as I can tell this is just normal stuff:

STRING : '"' ~'"'* '"' ;
WS: (' '|'\t')+ { skip(); } ;

I recommend that you don't deal with escape sequences or anything else in
your lexer rule (other then \" if you need that), but just get the string
then analyze and convert each string later - you will get much better
error messages:

Illegal escape code in literal at line 7, offset 34.

Instead of a terse lexer error. You should try and get your lexer to
accept anything without a lexer error (not always easy) and either give an
indication to the parser that something is wrong or record a custom error
from actions.

Jim

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Michael Cooper
Sent: Tuesday, October 23, 2012 8:14 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Easy method of preserving white space in string
literals

NOTE: This post covers the Java target.  Other targets- you're on your own
Just wanted to share something that LOOKS like it's working for me:

I have the typical need to ignore whitespace between all tokens EXCEPT
string literals.

So
    A    +B +    C
Is a perfectly acceptable construct that parses to tokens A, +, B, +, C

But
    "this is a big old string"
Needs to be one token (a string literal) with the spaces preserved.

Here's what I did:

Add the following to the lexer members section:
@lexer::members {
  boolean in_stringliteral = false;
}


Modify STRING_LITERAL token definition to set and un-set that value, like
so:

STRING_LITERAL
    : '"' { in_stringliteral=true; } //Set a variable indicating the lexer
has begun consuming a string literal
        ( options{greedy=false;}
          : ESC_SEQ
          | STR_CHAR
        )*
      '"' { in_stringliteral=false; } //Set a variable indicating the
lexer has finished consuming a string literal
    ;

(constructs for ESQ_SEC and STR_CHAR elided for brevity)

Finally, only set the channel to HIDDEN when not consuming a string
literal.


WHITESPACE
    :    (' ' | '\t') { if(in_stringliteral==false) $channel=HIDDEN; } ;

I tried doing this with gated predicates, but it just didn't work for me.
The downside is that the interpreter in Eclipse fails when you feed it
string literals with spaces in them.

YMMV, but this solved a major headache for me, and I never found a viable
solution online.

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address


More information about the antlr-interest mailing list