[antlr-interest] Manipulating lexer text output

Gavin Lambert antlr at mirality.co.nz
Sat Mar 31 16:26:16 PDT 2007


Ok, next question :)

Is there some way for a lexer rule to manipulate the output text 
of the lexer token, when it's not the rule responsible for 
generating that token?  (I'm using the C language target, if that 
makes a difference.)

For example, imagine this grammar fragment:

fragment
EscapeSequence
   : '\\'
     (  '\\'
     |  'n'
     |  ('\r' | '\n') WS?
     )
   ;
STRING
   : '"' (~('"' | '\\') | EscapeSequence)* '"'
   ;

This works as is, but the result is identical to the source text, 
including all escape sequences and quotes.  What I'd like to have 
instead is the semantic equivalent -- ie. output a STRING token 
where the quotes are removed and the escape sequences have been 
resolved, ie. \\ is converted to a single backslash, \n to a real 
newline character, and the final alt's text is removed entirely 
(that's a line-folding escape).  This means that parsing only has 
to be done once, instead of having to reparse the token text 
outside of ANTLR.

Rewriting rules sound like the sort of thing that would help here, 
but they don't seem to work in the lexer.  And I tried calling 
emitNew from the subrule, but that resulted in replacing the 
entire string, not just the substring matched by the 
subtoken.  Any ideas?



More information about the antlr-interest mailing list