[antlr-interest] Lexer bug?

Tue Oct 23 05:31:01 PDT 2007

At 01:00 24/10/2007, Clifford Heath wrote:
 >And then there's the fact that the string above has
 >white-space embedded, which means that it potentially
 >interacts negatively with the whitespace handling...
 >or maybe not in this case.

All lexer rules that permit embedded whitespace must explicitly 
specify it, since the whitespace hiding/skipping rule is at the 
same "level".  So that could potentially complicate your rule a 
bit if you wanted to handle it in there.  But if you're using 
Jim's rule (with multi-token emitting added), all you might need 
to do is to specify that whitespace is allowed after the second 
'.' of the '..' pair.

This is because if the input is "10 .. 30" you'll already get it 
as three separate tokens without doing any extra work.  If the 
input is "10..30" you'll need to handle it within the one rule 
(because of the dot recognition problem) -- but you can then emit 
the same three tokens as in the first case.  If the input is "10 
..30" the first number will be handled ok by itself, then you'll 
have to break apart the combined "..30" in a single rule and 
output two tokens (so again you end up with the same three tokens 
as in the first case).  If the input is "10.. 30" then you can 
either treat it like the second case (doing it all in one rule, by 
explicitly specifying the whitespace and outputting three tokens) 
or treat it like the third case (making a number with trailing .. 
output two tokens).

 >Still, I already dislike that I have to re-lex a NUMBER
 >to find whether it's octal, hex, integer or real.
 >I already paid a lexer to do that for me, so why am I
 >doing it again?

I don't know -- why are you?  There's certainly no need to -- just 
output different tokens in each case and then make a parser rule 
that accepts any of them when you're in a context that doesn't 
care what kind of numeric literal is provided.

(This is actually easier to do with a rule similar to what Jim 
proposed, since each path through the rule is more explicitly 
spelled out.)