[antlr-interest] Easy method of preserving white space in string literals

John B. Brodie jbb at acm.org
Mon Oct 22 18:25:36 PDT 2012


Greetings!

No need for a flag (aka lexer state) i believe.

Once ANTLRv3 sees the opening " of the string and commits itself to 
recognizing the literal; then no other token, including your WHITESPACE 
token, will be considered for recognition.

Your definitions for ESC_SEQ and/or STR_CHAR (hopefully these are 
fragment's) need include *all* possible characters comprising a literal 
in your language including blanks and tabs.

If on the other hand that either ESC_SEQ or STR_CHAR refer to WHITESPACE 
in order to recognize the blanks and tabs, then don't do that ;-) Make a 
fragment that just recognizes the blanks and tabs (and whatever else you 
need as whitespace) and then refer to that fragment in WHITESPACE and 
STR_CHAR rules.... It should actually work as a non-fragment but you 
will suffer the overhead of creating a WHITESPACE token that is simply 
thrown away when its $TEXT is incorporated into the STRING_LITERAL token.

What happened that caused you to need to add the flag (lexer state)? Did 
you try it from the command line, outside of Eclipse? I have never used 
the Eclipse plug-in so am unsure of its capabilities.

Hope this helps....
    -jbb


On 10/22/2012 08:13 PM, Michael Cooper wrote:
> NOTE: This post covers the Java target.  Other targets- you're on your own
> Just wanted to share something that LOOKS like it's working for me:
>
> I have the typical need to ignore whitespace between all tokens EXCEPT string literals.
>
> So
>      A    +B +    C
> Is a perfectly acceptable construct that parses to tokens A, +, B, +, C
>
> But
>      "this is a big old string"
> Needs to be one token (a string literal) with the spaces preserved.
>
> Here's what I did:
>
> Add the following to the lexer members section:
> @lexer::members {
>    boolean in_stringliteral = false;
> }
>
>
> Modify STRING_LITERAL token definition to set and un-set that value, like so:
>
> STRING_LITERAL
>      : '"' { in_stringliteral=true; } //Set a variable indicating the lexer has begun consuming a string literal
>          ( options{greedy=false;}
>            : ESC_SEQ
>            | STR_CHAR
>          )*
>        '"' { in_stringliteral=false; } //Set a variable indicating the lexer has finished consuming a string literal
>      ;
>
> (constructs for ESQ_SEC and STR_CHAR elided for brevity)
>
> Finally, only set the channel to HIDDEN when not consuming a string literal.
>
>
> WHITESPACE
>      :    (' ' | '\t') { if(in_stringliteral==false) $channel=HIDDEN; }
> ;
>
> I tried doing this with gated predicates, but it just didn't work for me.  The downside is that the interpreter in Eclipse fails when you feed it string literals with spaces in them.
>
> YMMV, but this solved a major headache for me, and I never found a viable solution online.
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address



More information about the antlr-interest mailing list