[antlr-interest] Lexer bug?

Clifford Heath clifford.heath at gmail.com
Mon Oct 22 17:47:05 PDT 2007


Jim Idle wrote:
>  > Jim Idle wrote:
>  > > This isn't a bug.
>  > Nonsense. Any lexer that consumes characters that aren't a legal token,
>  > and announces a legal token without flagging an error, has a bug.
> It wasn't my intention to offend and elicit an emphatic "nonsense" 
> response. However I should point out that the accusation is of course 
> erroneous. The lexer produces code that is in line with the original 
> design.

First up, let me say that I'm sorry my post was thought uncivil. I do
appreciate the helpful discussion and workarounds offered, and I don't
mean to disparage anyone.

However, I still maintain that the job of a lexer is to divide the input
into tokens, without discarding any. If it's unable to do that, it must
report an error. If not, then the tokens must be correctly matched. There
is no middle path, and any design that allows one is faulty, even if the
code implements the design perfectly. Such principles are black-and-white,
and that's why I used the word "nonsense".

With that out of the way, it's perfectly valid to look for pragmatic ways
to rewrite the rules to avoid the (design) bug, and I thank you for offering
suggestions.

With regard to the suggestions offered, I'm not sure I understand all of
them, and if I do, I'm not sure I want to implement that way. For example,
it seemed that one suggestion would have it that I should recognize the
string "0.12 ..  3.5" as a single token... and I'm *sure* I don't want to
do that!

Loring's suggestion seems closest to the money. I've just rewritten the
rule as follows (with decimal, octal and hexadecimal literals rolled in),
and it works. It does seem bizarre that I can't separate integers from
reals in the lexer, but I'll live with that: 

NUMBER
:	'0'
|	'0' ('0'..'7')+		// An octal integer
|	'0' 'x' HEXDIGIT+	// A hexadecimal integer
|	SIGN? '1'..'9' DIGIT*	// A decimal integer
|	SIGN? DIGIT+ FRACTION	// a real number
|	SIGN? DIGIT+ EXPONENT
|	SIGN? DIGIT+ FRACTION EXPONENT
|	SIGN? FRACTION EXPONENT?
;

This even parses "0...1" correctly... though I'd cane anyone who wrote that!

It's very odd to me that this behaves so differently, but there you go -
it does! Thanks to all for the help.

Clifford Heath.



More information about the antlr-interest mailing list