[antlr-interest] String lexing and partial tokens

Mon Nov 27 17:48:15 PST 2006

Jim--

The basic implementation is to have a "copy" stream
and to only copy characters there when necessary; that
is, when characters are deleted from the middle of a
lexed character sequence.  If you trim characters from
the ends of the sequence only, then there is no
copying.  There is then a four-state machine that
either 1.) skips leading characters, 2.) increments
the end counter, 3.) copies text, or 4.) skips
(possibly trailing) characters.

The reason for the current performance hit is that the
logic for character handling is repeated everywhere
instead of just in the rules that edit.  The 5-10%
figure does not include copying, which would add
additional overhead--but I've never run into a grammar
that needed copying, although I can see the value. 
The next revision will be to make only rules that edit
suffer the performance hit.  I will note that the
baseline ANTLR 3 approach of "construct your own
string" has much higher overheads for character
editing.

--Loring

--- Jim Idle <jimi at intersystems.com> wrote:

> Loring,
> 
> To my mind, 5-10% (assuming that you mean runtime)
> is still quite an overhead, which seems quite a lot
> for such a feature when there are other ways of
> achieving the same that don't really cost anything
> (at least in the C runtime any way). 
> 
> I would be interested in how you measure this,
> because anything that causes the token string to be
> created rather than just be indexes into the source
> would surely have a higher overhead than this. I
> suppose that what could be done would be that when a
> part of the token spec includes ! on a fixed length
> leading or trailing part of the token, then the
> start and end indexes could be adjusted at token
> emit time, but it just doesn't seem such a big deal
> to me, so long as there are reasonable ways of
> achieving the same thing manually. It seems that the
> removal of " in strings is in fact the main use for
> this functionality.
> 
> However, I don't believe that Ter rejected looking
> at this out of hand, just that for the moment there
> are plenty of other things to work on. That said,
> for my part, I think it is just a matter of
> documenting some ways to achieve the same thing and
> people getting used to them. I don't think that
> people object to changing ways of doing things if
> they are reasonable. While it is obviously quite a
> lot easier to just add ! to the matching text, you
> do this work once, whereas the resulting lexer will
> presumably run many more times than once; it seems
> that it is worth the small effort at grammar
> specification time to keep the lexer as trim as
> possible.
> 
> I am a fan of the ANTLR 3 approach of simplification
> over ANTLR 2, which generally yields leaner code
> generation, and transferring a certain amount of the
> effort to the grammar author. There are limits to
> this of course, but I think ANTLR 3 is a reasonable
> blend, given that it makes grammar programming in
> general so much easier than its predecessors. 
> 
> However I am sure that your efforts in this regard
> will be appreciated if they turn out to yield
> something that has very little overhead and little
> time to incorporate into the main ANTLR product.
> 
> Jim
> 
> 
> 
> -----Original Message-----
> From: Loring Craymer [mailto:lgcraymer at yahoo.com] 
> Sent: Monday, November 27, 2006 4:27 PM
> To: Jim Idle; antlr-interest at antlr.org
> Subject: Re: [antlr-interest] String lexing and
> partial tokens
> 
> 
> --- Jim Idle <jimi at intersystems.com> wrote:
> 
> ..
> > You can  ask Jim Idle about that, but we decided
> to
> > use methods for  
> > setting the text rather than implementing ! which
> > makes everything  
> > inefficient. I could swear there was something in
> > the documentation.
> 
> ! in the lexer does not "make everything
> inefficient";
> you just have to be smart about the implementation. 
> The lexer editing via ! that is currently in the
> Yggdrasil 0.5b releases (I'll have b2 out soon)
> costs
> about 5-10% (rough estimate from looking at
> generated
> code); once I can analyze which rules edit, that
> drops
> still further.
> 
> --Loring
> 
> 
>  
>
____________________________________________________________________________________
> Do you Yahoo!?
> Everyone is raving about the all-new Yahoo! Mail
> beta.
> http://new.mail.yahoo.com
> 
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.14.17/553 -
> Release Date: 11/27/2006
>  
> 
> -- 
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.14.17/553 -
> Release Date: 11/27/2006
>  
> 

____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com