[antlr-interest] Manipulating text in the lexer

Thu Feb 26 07:48:57 PST 2009

Hey again all,

So, having returned to ANTLR (as previously mentioned), I've been trying 
to do things that used to be possible, and appear no longer to be so. 
http://www.antlr.org/blog/antlr3/lexical.tml suggests that it's no 
longer possible to alter the content of a token away from what's on the 
input at all. Crafting an ASN.1 grammar this is rather a pain - as well 
as the obvious matter of wanting to be able to strip the '"' from each 
end of a string literal, ASN.1 string literals have an odd requirement 
on the handling of whitespace and newlines within them, hopefully 
illustrated by these grammar fragments:

fragment
CSTRINGNL : WSNONL* NL WSNONL* {setText("");};

CSTRING : '"' ((CSTRINGNL)=> CSTRINGNL | '""' | ~'"') '"';

WS : (WSNONL | NL) {$channel=HIDDEN;};

fragment
NL : ('\n' | '\r' | '\v' | '\f');

fragment
WSNONL : (' ' | '\t');

Ideally, I'd also want to turn the '""' that's found inside a string 
literal into a single '"' before passing it on to the parser, as there's 
no need whatsoever to hold onto that. However, it's a *requirement* to 
discard newlines, along with any other whitespace immediately preceding 
or succeeding each. It'd be really frustrating to have to change that at 
a later stage in processing.

So, can anyone clarify this for me, or let me know of some sort of 
workaround?

Thanks,

Sam Barnett-Cormack