[antlr-interest] Manipulating lexer text output

Wed Apr 4 01:43:18 PDT 2007

At 09:37 4/04/2007, Jim Idle wrote:
 >Well, perhaps, but the lexer is just looking at 
one character at a
 >time anyway and does not produce a string with that stuff in it
 >unless you ask for it - if you don't then you 
just get a token that
 >says where the start and stop is. So it isn't 
overly inefficient and
 >if you set the text explicitly then there is probably little
 >difference in execution time.

Imagine the following input stream, though:

   "test\tone\u00B2\r\nline \
      two"

The result, of course, should be:
   (String test<TAB>one²<CR><LF>line two)

All the bits starting with a \ are escape 
sequences and are handled by the EscapeSequence 
rule -- the String rule itself doesn't know how 
they are structured.  At present the 
EscapeSequence rule has no means of communicating 
the "real" value of the text it has just matched 
back out to the String rule, however.  It doesn't 
generate a token, because that's optimised out 
for subrules (and if it generates one anyway then 
it replaces the entire String instead of just 
that subrule's portion of the input); the return 
type is fixed, so it can't pass additional data 
back -- there's simply nothing it can do.

What I think should happen is that if the subrule 
*does* generate a token (which will only be 
because someone explicitly wrote some code to do 
so [or put in a rewrite rule or ! operator, if 
those start working in the lexer] -- it won't be 
the common case) then the parent rule should 
detect that and stitch the result into its own customised token.

To explain that a bit more, here's a couple of 
example cases (for the sake of example though I'm 
going to assume the top-level rule doesn't want 
to strip off the surrounding quotes):

1. The simple case, where the input is something 
like "test" -- ie. no escape sequences.  The 
String rule calls the Character rule (or just 
matches characters directly), but either way, 
nothing actually generates a token, so the String 
rule doesn't need to do any extra work and 
behaves exactly as it does now -- returning a 
token with no explicit text and just start & stop 
indexes into the input stream.

2. The complex case, with input as above.  In 
this case, String should start out as before, 
tracking characters but not doing anything much 
about it.  When it calls EscapeSequence to match 
the "\t", though, that rule generates a token 
starting at the \ and stopping at the t, but with 
explicit text set to a tab character.  When 
control returns to String, it detects that a 
token was generated and then immediately 
generates one of its own (starting at the quote 
and stopping at the 't' of the tab escape in the 
input), also with explicit text, containing the 
quote, the word "test", and the tab character 
(from the subtoken).  Then it should proceed 
through the rest of the input it would normally 
match, either appending directly (and extending 
the stop point) or appending the text from the 
subtoken (or the text between the start and stop 
point for the subrule if it doesn't generate a 
token of its own).  Alternatively, if it's 
simpler or more performant, the String rule could 
just keep track of all the components and stitch 
everything together only at the end (this might 
be better if failure and backtracking is fairly 
common).  Either way, if any subrule generates 
explicit text then the parent rule will have to 
do so as well.  And in this example, it should 
finally generate a single token containing the 
explicit text shown above (though with the quotes too).

I don't think this would be too hard to 
implement, and it should still remain performant 
(because it'd only be used when explicitly 
requested by the grammar), and it'd allow a lot more flexibility.