[antlr-interest] String lexing and partial tokens

Mon Nov 27 23:11:31 PST 2006

At 16:14 27/11/2006, Jim Idle wrote:
 >The lexer emits a token automatically if you have not emitted 
one,
 >but if you use (C output) emitNew() in an action then it will 
use
 >this as the token. So, to exclude the start and end character:
 >
 >STRING: '"' (~'"')* '"'
 >	{
 > 
emitNew(type,line,charPosition,channel,start,getCharIndex()-1);
 >	}

The thing is that this is a lot more parameters than I really want 
to deal with in a grammar.  It violates my "this should be simple" 
rule :)

Though I agree that having it not go allocating strings is a good 
thing, so avoiding $setText seems like a good idea.

How about something more like what I ended up hacking out, with a 
bit of extra support code to make it more palatable?  Like so:

STRING: '"' content=UnquotedText '"' { emitPartial($content); };
fragment UnquotedText: (~'"')*;

Where 'emitPartial(x);' is the equivalent of 'emit(x); 
ltoken()->setType(ltoken(), the_token_type_being_generated);'

That should be fairly simple to implement.

It'd be better still if the fragment weren't required, and you 
could write something like this (this generates an AST parse error 
from ANTLR at the moment):

STRING: '"' content=(~'"')* '"' { emitPartial($content); };

(maybe you'd have to have an extra set of parentheses around 
there; not sure.)

And the ultimate extension would then be to reintroduce the ! 
operator, which automatically did the above stuff if all the non-! 
components of the rule formed a contiguous block.  If they're 
non-contiguous, then it'd still be an error since you can't 
generate a single substring from the incoming char stream that way.