[antlr-interest] Easier way to do string literals?

Mon Oct 15 01:37:28 PDT 2007

Gavin Lambert wrote:
> At 20:18 15/10/2007, Vaclav Barta wrote:
>  >quotedString returns [ String value ]
>  >@init {
>  >    StringBuffer sb;
>  >} : {
>  >    sb = new StringBuffer();
>  >}
>  >    DQUOTE (
>  >        EscapeSequence { sb.append($EscapeSequence.getText()); }
>  >        | BareString { sb.append($BareString.getText()); }
>  >    )* DQUOTE { $value = sb.toString(); }
>  >    ;
> 
> That sort of thing is fine if all you're parsing is string constants, 
> but in a larger language it loses (apart from anything else, you've 
> probably got an auto-whitespace-stripper, whereas whitespace needs to be 
Sorry, I've simplified too much - the original has

quotedString returns [ String value ]
@init {
	StringBuffer sb;
} : {
	sb = new StringBuffer();
}
	DQUOTE (
		EscapeSequence { sb.append($EscapeSequence.getText()); }
		| BareString { sb.append($BareString.getText()); }
		| COLON { sb.append(':'); }
		| EQ  { sb.append('='); }
		| SP { sb.append($SP.getText()); }
		| TAB  { sb.append('\t'); }
		| StringChar { sb.append($StringChar.getText()); }
		| v = varUse { sb.append($v.value); }
	)* DQUOTE { $value = sb.toString(); }
	;

and the whole grammar (I've put it at 
http://mangrove.cz/antmaker/Loader.g - it's just an experiment with 
Makefile-like syntax, converting build instructions to Ant XML) is 
indeed a bit untypical in that it handles whitespace explicitly...

> preserved within strings).  And you're quite likely going to get random 
> Identifier and Number etc tokens in there, not just EscapeSequences and 
> BareStrings.  And unmatched comments, too -- block and line comment 
...doesn't distinguish quoted from unquoted strings, identifiers and 
numbers are just strings and if it had comments, they would be line 
comments and their marker would have to have a branch inside 
quotedString - so the example probably isn't as widely applicable as 
I've implied, :-) but I'd still like to parse string literals (that are 
sufficiently complicated to be parsed) by ANTLR...

> Now what you *could* do is to treat it like the island grammar example 
> and have a separate ANTLR grammar for parsing the internals of strings, 
> but that seems excessive to me for what amounts to a simple string 
> replace operation.
Is there really no way to parse C-like string literals in one pass?

	Bye
		Vasek