[antlr-interest] Lexer bug?

Mon Oct 22 19:01:14 PDT 2007

On 10/22/07, Clifford Heath <clifford.heath at gmail.com> wrote:
> Jim Idle wrote:
> >  > Jim Idle wrote:
> >  > > This isn't a bug.
> >  > Nonsense. Any lexer that consumes characters that aren't a legal token,
> >  > and announces a legal token without flagging an error, has a bug.
> > It wasn't my intention to offend and elicit an emphatic "nonsense"
> > response. However I should point out that the accusation is of course
> > erroneous. The lexer produces code that is in line with the original
> > design.
>
> First up, let me say that I'm sorry my post was thought uncivil. I do
> appreciate the helpful discussion and workarounds offered, and I don't
> mean to disparage anyone.
>
> However, I still maintain that the job of a lexer is to divide the input
> into tokens, without discarding any. If it's unable to do that, it must
> report an error. If not, then the tokens must be correctly matched. There
> is no middle path, and any design that allows one is faulty, even if the
> code implements the design perfectly. Such principles are black-and-white,
> and that's why I used the word "nonsense".
>

I could see how a person could perceive your statement of "black and
white" as being too strongly worded.  You can have strong opinions,
but that's stated as an absolute fact. I think the design of Antlr
worked fairly hard to not follow a principle you consider absolute.  I
happen to agree with you, but that's besides the point.

I think you just disagreed with a fundamental decision of Antlr3
(Antlr2 might also do it, but I don't know)... I mean Antlr3 works
fairly hard recover by skipping a single token and proceed on.

During interactive behavior, it seems like it'd be really nice, but
during a batch run (like compiling source), I really dislike it.  It'd
be nice if I had some opportunity to programatically control that.
I'd also say that fundamentally, I'd really like it if Antlr did some
of these:

1. Warned me at generation time that my grammar has an LL(1) case
where the error recovery might do something counter-intuitive.  So at
least, I'd know something was off prior to discovering during testing
my grammar.

2. Gave me programmatic ability to disable the LL(1) recovery at
generation and/or run time preferrable run time (or the ability to
generate two different parsers for the same grammer, one with error
recovery, the other out).

3. It never used LL(1) recovery until it had exhaustively searched for
other solutions.

If I could figure out how to get Antlr building, I'd try and help.
Alas the Ant scripts are failing me, and I haven't had time to fix it
(I think it's mostly that Antlr 2.7 isn't installed correctly on for
Ant to pick it up).

Fundamentally, the automatic recovery feels like it can cause some of
the same problems that HTML and Web Browsers did forever.  Given some
input that is really close to what I want, but is slightly wrong,
leads to very strange behavior because some tool is guessing what I
meant instead of saying "I'm sorry Dave, I'm afraid I can't do that.".
 I'd really like a way to put Antlr into a very, very strict mode.
Hacking around in the exception handling of both the parser and the
lexer is just inelegant.

Alas, I get nothing but silence so far.  Hopefully folks don't find my
e-mails too annoying.

Thanks,
    Kirby