[antlr-interest] Manipulating lexer text output
Gavin Lambert
antlr at mirality.co.nz
Sat Mar 31 16:26:16 PDT 2007
Ok, next question :)
Is there some way for a lexer rule to manipulate the output text
of the lexer token, when it's not the rule responsible for
generating that token? (I'm using the C language target, if that
makes a difference.)
For example, imagine this grammar fragment:
fragment
EscapeSequence
: '\\'
( '\\'
| 'n'
| ('\r' | '\n') WS?
)
;
STRING
: '"' (~('"' | '\\') | EscapeSequence)* '"'
;
This works as is, but the result is identical to the source text,
including all escape sequences and quotes. What I'd like to have
instead is the semantic equivalent -- ie. output a STRING token
where the quotes are removed and the escape sequences have been
resolved, ie. \\ is converted to a single backslash, \n to a real
newline character, and the final alt's text is removed entirely
(that's a line-folding escape). This means that parsing only has
to be done once, instead of having to reparse the token text
outside of ANTLR.
Rewriting rules sound like the sort of thing that would help here,
but they don't seem to work in the lexer. And I tried calling
emitNew from the subrule, but that resulted in replacing the
entire string, not just the substring matched by the
subtoken. Any ideas?
More information about the antlr-interest
mailing list