[antlr-interest] Manipulating lexer text output

Sun Apr 1 13:51:28 PDT 2007

I think it's in the faq:

http://www.antlr.org/wiki/pages/viewpage.action?pageId=1461

Ter
On Mar 31, 2007, at 4:26 PM, Gavin Lambert wrote:

> Ok, next question :)
>
> Is there some way for a lexer rule to manipulate the output text of  
> the lexer token, when it's not the rule responsible for generating  
> that token?  (I'm using the C language target, if that makes a  
> difference.)
>
> For example, imagine this grammar fragment:
>
> fragment
> EscapeSequence
>   : '\\'
>     (  '\\'
>     |  'n'
>     |  ('\r' | '\n') WS?
>     )
>   ;
> STRING
>   : '"' (~('"' | '\\') | EscapeSequence)* '"'
>   ;
>
> This works as is, but the result is identical to the source text,  
> including all escape sequences and quotes.  What I'd like to have  
> instead is the semantic equivalent -- ie. output a STRING token  
> where the quotes are removed and the escape sequences have been  
> resolved, ie. \\ is converted to a single backslash, \n to a real  
> newline character, and the final alt's text is removed entirely  
> (that's a line-folding escape).  This means that parsing only has  
> to be done once, instead of having to reparse the token text  
> outside of ANTLR.
>
> Rewriting rules sound like the sort of thing that would help here,  
> but they don't seem to work in the lexer.  And I tried calling  
> emitNew from the subrule, but that resulted in replacing the entire  
> string, not just the substring matched by the subtoken.  Any ideas?
>