[antlr-interest] Lexer bug?

Mon Oct 22 09:29:44 PDT 2007

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Clifford Heath
> Sent: Sunday, October 21, 2007 5:49 PM
> To: antlr-interest at antlr.org
> Cc: public-antlr-interest-ErkRXerACLvYtjvyW6yDsg at ciao.gmane.org
> Subject: Re: [antlr-interest] Lexer bug?
> 
> Jim Idle wrote:
> > This isn't a bug.
> 
> Nonsense. Any lexer that consumes characters that aren't a legal token,
> and announces a legal token without flagging an error, has a bug.

I want to be clear what I am saying here, as it may be of benefit to others.

Basically, whether one's interpretation of the current behavior is that it
is incorrect or a bug (and I will let Ter comment as to whether these are
bugs are not ;-), the thing is that this is what currently happens if you
try and separate certain lexical decisions into separate tokens and then
expect the ANTLR analysis to work out what is required. I am of a practical
nature, and so generally work with what I have. Hence the way I write my
lexers is with the current behavior in mind. The current behavior being what
it is, I suggest a more pragmatic approach, being as detailed in previous
posts, and which is, where possible, to combine common roots into one rule
and split the lexing decisions at the obvious points; where the common
root(s) diverge. Somehow, to me at least, this seems intuitive and natural,
and it also allows the lexer to recognize certain malformed tokens, which
can result in clearer error messages.

Basically, I consider it more like a compiler, that may well one day be able
to perform more analysis of my source code and produce better target code,
but doesn't right now. The better input I give it, in terms of what it can
deal with, then the more efficient will be my code. This situation is of
course not quite analogous to the optimizing of a poor algorithm vs a good
algorithm, but at least in approach, it is how I look at it. Hence, it seems
to me that if I encode the recognition of one or more tokens, which have
common roots, as explicitly having these common roots, then I give the
analysis and code generation phases a better chance of producing better code
than they otherwise would (for the moment).

Now, Ter has acknowledged that there is plenty of scope for optimizing the
generated code, and that to date this has not been top priority other than
making sure that the code is not weak in this regard. Hence, when you look
at the generated code for some things, you can see that there is room for
improvement, but that generally I is sound. This is of course more to do
with how the analysis works (or doesn’t, if you believe this to be a bug
:-).

So, perhaps this is a good point to separate the points that Gavin and
Loring make, which seem reasonable points, from the "this is what works and
you won't regret this structure anyway", which is what I am getting at.

It seems to me, that whatever ANTLR did or do not do with constructs
embodied by the examples in this thread, that pooling the descriptions into
one rule results in a much more readable source grammar, where you can
easily visualize the paths the lexer must take to resolve the tokens. Hence,
even should Ter change the current analysis (and he may, I don't know what
the crazy bugger will do next ;-)), that I should probably instinctively
code the lexer rules like this. When I return to a grammar 12 months later,
it will still be obvious what the lexer is doing, whereas with disparate
rules, I will have to manually inspect the code until I remember. While
predicates and so on are not 'wrong', or they would not be available at all,
it seems that if I have an option not to use a predicate, by utilizing a
different structure, then I ought to take it.

I think (and correct me Loring if I am wrong), that Loring is saying that
the code produced by my example, should also be the structure produced when
the tokens are specified separately, as in fact the same thing is being
asked for; he may well have a point. However, in purely practical terms, it
seems better to acknowledge how things are at the moment and utilize a
practical solution that does not involve tricks or hacks, but just
specifying the grammar in a way that the tool is happy with. While the
theory is interesting and if there is a 'bug' in the LL(*) implementation,
then it needs to be explored, most of us are trying to use the tool to
produce a product of some sort.

It wasn't my intention to offend and elicit an emphatic "nonsense" response.
However I should point out that the accusation is of course erroneous. The
lexer produces code that is in line with the original design. If one wants
it to work a different way, then this is fine and one is free to make the
point and argue sensibly for a change of heart, as both Gavin and Loring do
for instance. But, we should maintain civility if possible.

So, in short, my post was there to help you get this particular lexer
working, and it will work if you use that approach. We can discuss how it
'should' work until the cows come home, but in the same way that DBMS xyz
'should' do 'abc', but doesn't, then we take the pragmatic approach and work
with what we have, rather than hoping people will read Codd properly and
implement that and so on. This will result in less frustration and more
success all around. I am sure that Ter did not produce this tool so that it
would frustrate the hell out of anyone - rather that opposite. It is however
a tricky problem and if it were not, then there would already be the perfect
analysis and code generator in place, and it would have been done long ago.
I find ANTLR to be the best tool around at the moment, it may not be so for
every individual and every situation!

Anyway, I hope that everyone is successful in their endeavors and that we
can maintain the helpful attitude that generally prevails here!

Jim

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.488 / Virus Database: 269.15.5/1084 - Release Date: 10/21/2007
3:09 PM

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20071022/a4b308b2/attachment-0001.html