[antlr-interest] Manipulating lexer text output

Tue Apr 3 11:54:09 PDT 2007

Yeah - this has come up a lot and really the solution at the moment is
to do this in the parser. However you could at least do this in the
STRING rule with a call to a small function that removed this stuff and
then emit the token with the result as the text.

Jim

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Gavin Lambert
Sent: Saturday, March 31, 2007 4:26 PM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Manipulating lexer text output

Ok, next question :)

Is there some way for a lexer rule to manipulate the output text 
of the lexer token, when it's not the rule responsible for 
generating that token?  (I'm using the C language target, if that 
makes a difference.)

For example, imagine this grammar fragment:

fragment
EscapeSequence
   : '\\'
     (  '\\'
     |  'n'
     |  ('\r' | '\n') WS?
     )
   ;
STRING
   : '"' (~('"' | '\\') | EscapeSequence)* '"'
   ;

This works as is, but the result is identical to the source text, 
including all escape sequences and quotes.  What I'd like to have 
instead is the semantic equivalent -- ie. output a STRING token 
where the quotes are removed and the escape sequences have been 
resolved, ie. \\ is converted to a single backslash, \n to a real 
newline character, and the final alt's text is removed entirely 
(that's a line-folding escape).  This means that parsing only has 
to be done once, instead of having to reparse the token text 
outside of ANTLR.

Rewriting rules sound like the sort of thing that would help here, 
but they don't seem to work in the lexer.  And I tried calling 
emitNew from the subrule, but that resulted in replacing the 
entire string, not just the substring matched by the 
subtoken.  Any ideas?