[antlr-interest] String lexing and partial tokens
Gavin Lambert
antlr at mirality.co.nz
Mon Nov 27 23:11:31 PST 2006
At 16:14 27/11/2006, Jim Idle wrote:
>The lexer emits a token automatically if you have not emitted
one,
>but if you use (C output) emitNew() in an action then it will
use
>this as the token. So, to exclude the start and end character:
>
>STRING: '"' (~'"')* '"'
> {
>
emitNew(type,line,charPosition,channel,start,getCharIndex()-1);
> }
The thing is that this is a lot more parameters than I really want
to deal with in a grammar. It violates my "this should be simple"
rule :)
Though I agree that having it not go allocating strings is a good
thing, so avoiding $setText seems like a good idea.
How about something more like what I ended up hacking out, with a
bit of extra support code to make it more palatable? Like so:
STRING: '"' content=UnquotedText '"' { emitPartial($content); };
fragment UnquotedText: (~'"')*;
Where 'emitPartial(x);' is the equivalent of 'emit(x);
ltoken()->setType(ltoken(), the_token_type_being_generated);'
That should be fairly simple to implement.
It'd be better still if the fragment weren't required, and you
could write something like this (this generates an AST parse error
from ANTLR at the moment):
STRING: '"' content=(~'"')* '"' { emitPartial($content); };
(maybe you'd have to have an extra set of parentheses around
there; not sure.)
And the ultimate extension would then be to reintroduce the !
operator, which automatically did the above stuff if all the non-!
components of the rule formed a contiguous block. If they're
non-contiguous, then it'd still be an error since you can't
generate a single substring from the incoming char stream that way.
More information about the antlr-interest
mailing list