[antlr-interest] (follow up) setting, altering text in lexer rules

Tue Jun 13 02:20:58 PDT 2006

On 6/13/06 2:44 AM, "Terence Parr" <parrt at cs.usfca.edu> wrote:

> 
> On Jun 12, 2006, at 5:42 PM, shmuel siegel wrote:
> 
>> Terence Parr wrote:
>> 
>>> Well the ! thing has always been a pain in the ass ;)  I'd rather
>>> opt for speed for now with a workable solution and see what
>>> happens in the future. :)
>>> Ter
>> I, for one will really miss "the ! thing" when dealing with
>> strings. It is much easier to adjust the end pointers to strings
>> than it is to create a new string as a substring. Can we at least
>> have that feature for 3.0, i.e., the ability to throw away
>> beginning and ending character sequences. Forcing the programmer to
>> manipulate strings will more than throw away any perceived time
>> savings that you think you will be achieving by avoiding this case,
>> not to mention making the programmer's job harder.
> 
> So you are saying that STRING token dominates the tokenizing time?  I
> doubt it.  Build your own token and emit that uses char indexes into
> the char buff that are one off.  No char creation at all.

I think that this is the magic bullet that people are missing. It seems to
me that the most common case is to remove things like quote marks from
strings and not to place char[4] before char[9] and add '$$$' on the end or
something, which would probably work out better in more extensive . In the C
version this would be intercepting the token, incrementing a pointer and
decrementing a counter. As far as I can tell this isn't that different in
Java.

The token generally just points into the input buffer (some trickiness may
be required from reading from sockets forever and so on, but there would
seem logical to always be some 'endpoint' for a stream, even if a socket
remained open for ever).

Personally, I would much rather implement a little bit of custom code for
the odd token - where I can spend time on it if it is important performance
wise, rather than sacrifice performance for the general case - simple is
best and it looks to me that Ter has followed the adage that it should be as
simple as possible but no simpler pretty well.

The point of ANTLR 3 is to make the generation of recognizers simple, but
not to completely rid you of any responsibility for the performance of your
lexer/parser/tree parser. The most convenient expression is not necessarily
the best.

That said, if this became the number one perceived 'problem' with ANTLR 3,
then I will be applauding as then there is only that feature to add to a 3.1
version ;-). I think that all Ter is saying here is "let me come back to
this one if people find it a real pain."

Keep on rockin',

Jim