[antlr-interest] Lexer bug? (with test cases!)

Wed Oct 24 10:57:45 PDT 2007

Jim Idle wrote:
>> -----Original Message-----
>> From: Austin Hastings
>>
>> Terence Parr wrote:
>>     
>>> It appears it's not backtracking because only 1 extra char is enough,
>>> but it cannot be statically determined if we assume any char can
>>> follow.  I think you could make the subrule more complicated and it
>>> would require 2 extra then 3 then 4 etc...
>>>       
>> I think that inserting a catch{} block for optional subrules would go a
>> long way toward eliminating this particular aspect of the problem. When
>> you don't match correctly, throw an exception, catch it, and reset. The
>> consumption of the input becomes temporary until the token completes,
>> but that's straightforward.
>>     
>
> Remember that not all the target languages have exceptions and that
> exceptions are tediously slow as they are meant to be used for actual
> exception conditions, not ways of programming your way out of algorithmic
> paths. It seems that not too many people are aware of this, and it is
> perhaps because the term exception implies that you can make your own
> algorithmic exceptions I guess. 
>   
Actually, I was ignoring most of that for the simple reason that Antlr 
*currently* throws an exception (in Java) when a token fails to appear 
as it wants. My suggestion was that for possibly-zero-length 
sub-elements, a try/catch would be a simple patch to the existing 
template(s) that would address the issue. In other words, "I made a 
LL(1) guess about this, but I was wrong. Eat the exception. Nothing to 
see here, folks. Move along."

> Apart from this, as the lexer will generally be the biggest consumer of your
> CPU resources, other than what you potentially can do in your actions, it is
> worth thinking it through and not just letting the tool spit out a morass of
> code that will seemingly do the job for you but will be throwing exceptions
> all over the place and running like a dog. Most people want/need efficient
> recognizers :-)
>   
I question this statement, since there are so few ops/character being 
performed. But any discussion of lexer performance is a red herring 
anyway, since the generated code already uses exceptions to implement 
failure/recovery. The issue is improving the recovery.

>>>>> Remember folks: you can't just list a bunch of lexer rules that
>>>>> make sense to a human and expect ANTLR to "make it so".
>>>>>           
>>>> Why not?  This case seems fairly straightforward.
>>>>         
>>> Not to me and I spent 4 years making LL(*) work. ;)
>>>       
>> The original question was regarding Number = DIGIT+ ('.' DIGIT+)? and
>> how LL(*) was deciding that '.' implied the entire chain, when it
>> didn't. Perhaps another year would help, but I don't see how this
>> counts
>> as working.
>>     
>
> Ter is saying that it is behaving as he expects it to for this case. One's
> definition of working may of course differ from his ;-), but it is not
> broken, it works as he expects it to. Personally, as I can trivially express
> such a case in an ANTLR lexer, and the resulting code is more maintainable
> (because I do not have to imagine the implied interactions), I don't worry
> about it. It seems trite. While some things might be more 'naturally'
> expressed in say {f}lex, many other things are horrible. Even in this we
> probably all differ, because what seems natural to one does not to another -
> the idea of throwing together some lexer rules and having ANTLR just work it
> out has appeal but doesn't seem natural to me!
>   
I must have missed that email, then. I saw this one 
(http://www.antlr.org/pipermail/antlr-interest/2007-October/024345.html) 
but of the two NUMBERS:

NUMBER
	: ('0'..'9')+
		(
			 '.'
			 	(
			 		 '.' ('0'..'9')+ { $type = RANGE; }
			 		| ('0'..'9')+
			 		| // ERROR
			 	)
			| // Just an integer
		)
	;

NUMBER
	: ('0'..'9')+
		(
			 ('.' '0'..'9')=> ('.' ('0'..'9')+)
			|// Just an integer
		)
	;

Which one is it that is "trivial," the one with token type overriding, 
or the one with syntactic predicates?

One of the reasons I like Antlr3 is the relatively clean syntax. It's 
very approachable, and for the most part quite clear. It certainly seems 
to allow me to "write what I need," and that means writing regular 
expressions that describe what I'm trying to recognize. The original 
question involved what I thought was a "trivial" regex for a 
possibly-fractional number:

DIGIT+ ('.' DIGIT+)?

except it was wrapped up in fragments a little bit. Of course, that 
isn't quite as exact as:

Number	: '-'? Digit+ ( '.' Digit+)?; //(from: http://www.antlr.org/wiki/display/ANTLR3/JSON+Interpreter)

> However, there is nothing to stop anyone programming their lexer in {f}lex,
> or any other system for that matter, and supplying the token stream that way
> of course! There may well be cases where a lex like specification is easier
> for your particular situation. You can find references on the Wiki that show
> this to have been one of Ter's thoughts up front, though I imagine he was
> thinking more of hand crafted lexers that had to do strange things to
> tokenize, rather than a different lexer generator per se.
>   
"If you don't like it, then leave" is valid, but I think most of us here 
like it, at least enough to talk about how to make it better, and to 
maybe try and understand why it doesn't work like we think it should.

I notice that the Antlr book featured about 2 pages that talked about 
lexing, and none at all that were dedicated to lexer issues. This is 
despite the fact that the same(?) behavior was present in Antlr2 -- at 
least, I notice Antlr2 grammars like the f77 grammar that have the same 
kind of lexer-predicate stuff to disambiguate tokens that I cited from 
you above. That's probably because Mr. Parr, like just about everyone 
else, doesn't find lexing interesting or exciting. And rightfully so - 
lexing is boring compared to parsing. But that's only true if the lexer 
gets out of the way. Like a light switch, it's supposed to "just work." 
When it doesn't, it's bewildering.

>> present-day case.
>>
>> Besides, isn't that what backtrack=... is for? :-) Maybe there's a
>> lexer::backtrack switch needed? Or maybe k needs to be * instead of 1.
>>     
>
> The backtrack option is for the parser, and while I can see its usefulness
> for prototyping, I personally believe many that are new to ANTLR are getting
> themselves into a lot of trouble with this option ;-)
>   
You've said that a few times lately, and I, for one, wish that you would 
go ahead and write the "how to get out of backtracking" tutorial that 
seems called for. I've got backtracking turned on because I inherited 
(stole) my grammar. Because I fear and mistrust Antlr, I am afraid to 
turn it off because I don't have a good set of test cases built yet. 
(Too much time reading antlr-interest, probably. ;-)

>> I don't think it matters, does it? You're in the recursive-descent
>> business, not the state-machine business. You sell chicken, they sell
>> hamburgers.
>>     
>
> Yeah except that some people keep saying things akin to "Lex does it like
> this...", hence Ter is establishing that he understands how lex like systems
> are put together, and the way that ANLTR produces lexers is not down to some
> lack of knowledge on his part about other ways of constructing lexer
> generators, which I think most people would have to agree with.
>   

I prefer think of the situation above as an oversight rather than 
deliberate sabotage, so I'm not going there.

=Austin