[antlr-interest] Allowing and maintaining space characters in string literals

John B. Brodie jbb at acm.org
Thu Sep 8 11:58:00 PDT 2011


Greetings!

Have you looked at the Java grammar in the v3 example suite?
also....

On Thu, 2011-09-08 at 18:08 +0000, Janet.Hurwitz at usc-bt.com wrote:
> Hello- I'm working on a grammar that needs to support embedded blanks in strings: "identifier=two words"
> The interpreter keeps breaking at 'two' and doesn't know what to do with 'words'.

don't use the interpreter. it has some quirks.

> I was initially ignoring white space (because 'id1 = oneword, id2 =" two words"' must also be supported with spaces around the = and ,), but obviously, can't do that.
> I have tried what was suggested in an archived post:
> 
> STRING_LITERAL : (STRCHAR)+ ( ((' ')+ STRCHAR)=> (' ')+ (STRCHAR)+ )*

are you lexing the leading/trailing quote marks separately from the
characters comprising the string literal?

if so don't do that.

> But that didn't work either! (no viable alternative at input 'words'). It's not including 'words' as part of the string.
> 
> In my grammar:
> fragment LETTER :('a'..'z' | 'A'..'Z');
> fragment DIGIT : '0'..'9';
> fragment OTHERCHARS : ('.' | '/' | '-' | '&');
> STRCHAR : (LETTER | DIGIT | OTHERCHARS)+;
> 
> I have tried various combinations of handling the blank in the lexing v. the parsing, including trying to create a quoted-string rule.
> I will have to support the following:

you want the string literal to be processed completely by the lexer,
from the opening quote up to and including the closing quote. that way
no other tokens will interfere with handling the characters between the
quote marks.

> 
> "identifier =two words"
> identifier ="two words"
> 
> The identifier=value pairs appear in a comma-separated line. There are various nested structures of identifier=value pairs involved, which is why both of the above formats are supported.
> 
> *** Bottom line*** I just want to indicate: If a space appears between quotation marks, include it as part of the current token; if not, throw it away.
> 
> I have everything working in a complex structure and tree walker except for the embedded blanks allowed in strings! Any suggestions are appreciated.

these lexer rules work for me:

STRING : '"' (options{greedy=false;}:( ~('\\'|'"') | ('\\' '"')))* '"'; 

WS : ( ' ' | '\t' | '\f' | '\r' | '\n' )+ { $channel=HIDDEN; } ;

Hope this helps...
   -jbb




More information about the antlr-interest mailing list