[antlr-interest] String lexing and partial tokens

Sat Nov 25 03:56:38 PST 2006

What's the new 3.0 way to do string lexing?  I'd like to have it 
strip off the surrounding quotes so that the token contains just 
the text itself.  My first attempt was this, since it's the v2 
way:

STRING: '"'! ( ~'"' )* '"'!	;

But that gives me this error:

error(149): Message.g3:101:7: rule STRING uses rewrite syntax or 
operator with no output option or lexer rule uses !

Looking in the archives seems to indicate that ! is no longer 
supported, which is a pain in the butt.  It was a nice simple 
syntax, and the alternatives all seem a lot more 
complicated.  Incidentally, what *is* the recommended 
alternative?  Further posts seemed to suggest that calling 
$setText or setText would do the trick, but those functions don't 
seem to exist in the C runtime (which is what I'm trying to use); 
or at least I can't find them.

For the moment I've ended up with the following, which seems to 
work but just seems a bit too evil to me...

STRING: '"' content=UnquotedText '"'	{ emit($content); 
ltoken()->type = STRING; };

fragment UnquotedText:	(~'"')*	;

(the fragment seemed a little silly, but it wouldn't accept the 
label otherwise.)

On an only-slightly-related note, I was also wondering what's the 
right way to deal with lexical ambiguity?  Say I've got one 
parsing context (eg. after a #include in C) where backslashes are 
treated literally, not as escapes, and another context (anywhere 
else) where they should be used as an escape sequence.  And again, 
ideally I want the resulting token to contain the 'real' string 
(ie. after escapes had been acted on).  Is this even possible?  (I 
imagine you could do it by treating it as an island grammar.  But 
that seems a little heavyweight.)