[antlr-interest] Antlr 3 and the newline token problem
mail at martin-probst.com
Sat Nov 26 08:52:36 PST 2005
> In any case you've omitted the per-character call for col/offset tracking.
> We were discussing line/col/offset counting not just newlines.
Well, the offset gets tracked anyway, as ANTLR is going through a String
where it has to track the input position anyways. That value is IIRC
also accessible (or could be made accessible very easily).
What is left is line breaks. How would you imagine ANTLR Lexers do that
more efficiently? E.g. always checking if the next character(s) is a \r
\n, \n or \r? What about users that want \0 to be their line separator?
Or users that don't want that at all?
> If the lexer was built to do it properly, there would be no function calls
> at all.
The overhead of a function call on x86 is very low. Plus, your compiler
might decide to inline, at least in managed languages, as said. For C++
a no-virtual-method-needed way via templates has been discussed.
> > I don't know what you're
> > doing with the 4000 lines you have parsed in the same time,
> > but are 4000 de-refs really significant compared to stepping
> > through the parsing rules for 4000 lines of code and building the AST?
> Lexers don't build ASTs. The per-char calls needed for line/col/offset
> tracking would definitely hurt lexer performance if the counts were tacked
> on via overridden methods.
The only thing that is (currently) done using an overridden method is
the newline thing, isn't it? A per character virtual method call would
be ugly, that's true.
Are you using the Lexer standalone? Even in that case I'd wonder if it
really makes a difference. For each character you have at least one
switch, you have the testing of alternatives etc. Will a virtual method
call for every ~20 characters make a difference bigger than maybe 1%? I
think there are more important places where ANTLR could - and is - be
enhanced, e.g. the String copying thing or various things in the C++
part that have been discussed countless times on this list.
I'm not generally arguing against including something like that, but
you'd have to find a very flexible way to do so. Otherwise users will be
unhappy because it doesn't match what they want to have, and their
solution might get more complex.
More information about the antlr-interest