[antlr-interest] String lexing and partial tokens
Gavin Lambert
antlr at mirality.co.nz
Sat Nov 25 03:56:38 PST 2006
What's the new 3.0 way to do string lexing? I'd like to have it
strip off the surrounding quotes so that the token contains just
the text itself. My first attempt was this, since it's the v2
way:
STRING: '"'! ( ~'"' )* '"'! ;
But that gives me this error:
error(149): Message.g3:101:7: rule STRING uses rewrite syntax or
operator with no output option or lexer rule uses !
Looking in the archives seems to indicate that ! is no longer
supported, which is a pain in the butt. It was a nice simple
syntax, and the alternatives all seem a lot more
complicated. Incidentally, what *is* the recommended
alternative? Further posts seemed to suggest that calling
$setText or setText would do the trick, but those functions don't
seem to exist in the C runtime (which is what I'm trying to use);
or at least I can't find them.
For the moment I've ended up with the following, which seems to
work but just seems a bit too evil to me...
STRING: '"' content=UnquotedText '"' { emit($content);
ltoken()->type = STRING; };
fragment UnquotedText: (~'"')* ;
(the fragment seemed a little silly, but it wouldn't accept the
label otherwise.)
On an only-slightly-related note, I was also wondering what's the
right way to deal with lexical ambiguity? Say I've got one
parsing context (eg. after a #include in C) where backslashes are
treated literally, not as escapes, and another context (anywhere
else) where they should be used as an escape sequence. And again,
ideally I want the resulting token to contain the 'real' string
(ie. after escapes had been acted on). Is this even possible? (I
imagine you could do it by treating it as an island grammar. But
that seems a little heavyweight.)
More information about the antlr-interest
mailing list