[antlr-interest] Lexer bug? (with test cases!)

Jim Idle jimi at temporal-wave.com
Wed Oct 24 09:12:59 PDT 2007



> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Austin Hastings
> Sent: Wednesday, October 24, 2007 4:44 AM
> To: Terence Parr
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Lexer bug? (with test cases!)
> 
> Terence Parr wrote:
> >
> > It appears it's not backtracking because only 1 extra char is enough,
> > but it cannot be statically determined if we assume any char can
> > follow.  I think you could make the subrule more complicated and it
> > would require 2 extra then 3 then 4 etc...
> 
> I think that inserting a catch{} block for optional subrules would go a
> long way toward eliminating this particular aspect of the problem. When
> you don't match correctly, throw an exception, catch it, and reset. The
> consumption of the input becomes temporary until the token completes,
> but that's straightforward.

Remember that not all the target languages have exceptions and that
exceptions are tediously slow as they are meant to be used for actual
exception conditions, not ways of programming your way out of algorithmic
paths. It seems that not too many people are aware of this, and it is
perhaps because the term exception implies that you can make your own
algorithmic exceptions I guess. 

Apart from this, as the lexer will generally be the biggest consumer of your
CPU resources, other than what you potentially can do in your actions, it is
worth thinking it through and not just letting the tool spit out a morass of
code that will seemingly do the job for you but will be throwing exceptions
all over the place and running like a dog. Most people want/need efficient
recognizers :-)

> >> >Remember folks: you can't just list a bunch of lexer rules that
> >> >make sense to a human and expect ANTLR to "make it so".
> >>
> >> Why not?  This case seems fairly straightforward.
> >
> > Not to me and I spent 4 years making LL(*) work. ;)
> 
> The original question was regarding Number = DIGIT+ ('.' DIGIT+)? and
> how LL(*) was deciding that '.' implied the entire chain, when it
> didn't. Perhaps another year would help, but I don't see how this
> counts
> as working.

Ter is saying that it is behaving as he expects it to for this case. One's
definition of working may of course differ from his ;-), but it is not
broken, it works as he expects it to. Personally, as I can trivially express
such a case in an ANTLR lexer, and the resulting code is more maintainable
(because I do not have to imagine the implied interactions), I don't worry
about it. It seems trite. While some things might be more 'naturally'
expressed in say {f}lex, many other things are horrible. Even in this we
probably all differ, because what seems natural to one does not to another -
the idea of throwing together some lexer rules and having ANTLR just work it
out has appeal but doesn't seem natural to me!

However, there is nothing to stop anyone programming their lexer in {f}lex,
or any other system for that matter, and supplying the token stream that way
of course! There may well be cases where a lex like specification is easier
for your particular situation. You can find references on the Wiki that show
this to have been one of Ter's thoughts up front, though I imagine he was
thinking more of hand crafted lexers that had to do strange things to
tokenize, rather than a different lexer generator per se.

> 
> present-day case.
> 
> Besides, isn't that what backtrack=... is for? :-) Maybe there's a
> lexer::backtrack switch needed? Or maybe k needs to be * instead of 1.

The backtrack option is for the parser, and while I can see its usefulness
for prototyping, I personally believe many that are new to ANTLR are getting
themselves into a lot of trouble with this option ;-)

> 
> 
> > Have you built a lex like creature before?  I have back in 1988.  I
> > believe I'm remembering how it operates correctly.
> >
> I don't think it matters, does it? You're in the recursive-descent
> business, not the state-machine business. You sell chicken, they sell
> hamburgers.

Yeah except that some people keep saying things akin to "Lex does it like
this...", hence Ter is establishing that he understands how lex like systems
are put together, and the way that ANLTR produces lexers is not down to some
lack of knowledge on his part about other ways of constructing lexer
generators, which I think most people would have to agree with.

Jim 



More information about the antlr-interest mailing list