[antlr-interest] Re: feature request: Token.getOffset()

Sun Dec 7 03:51:58 PST 2003

--- In antlr-interest at yahoogroups.com, "cj_daly" <cj_daly at y...> 
wrote:
> 
> I see how this could work by overriding tab() and newline(), but I 
was 
> thinking tracking absolute offset could be generally useful - so 
why 

I'm not convinced--editors tend to present a line/column view of 
text, which is why it makes sense to support keeping that 
information for mapping errors to input.  There aren't many tools 
routinely used that do the same with absolute offsets.

Absolute offset does not have the same utility, and it is a good 
idea for the lexer to track minimal state information--the tracking 
affects lexer performance.  ANTLR 2 lexers are slow enough as it is 
(ANTLR 3 should fix that--Ter's working on a few optimizations.

> not put it into the codebase.  And I doubt that adding an int and 
> incrementing it in a couple of places (where column is currently 
> changed in CharScanner) is going to affect performance or 
> maintainability.
> 
> Here's another angle: isn't offset a more fundamental measure than 
> line/column to begin with?  I mean your input source could be 
bits, 
> bytes, chars, nodes or whatever and line/column may not have any
> meaning in some of those cases, but offset is your way of tracking
> a token back to its place in the input source.
> 
> just my 2 bits...

It may not seem like much, but it is a counter incremented for each 
character, and the optimum is to test a character and move to the 
next unless the end of a token has been reached.  There could be a 
20% performance hit from maintaining another character counter--
better not to add that as a matter of course and reconstruct it from 
the line/column support when needed.

--Loring

> --- In antlr-interest at yahoogroups.com, "lgcraymer" <lgc at m...> 
wrote:
> > How about: override tab() to keep a correction value for column 
> > information, and override newline() to track offset for the 
start of 
> > the current line.  Then you can compute the character offset 
> > yourself: (line start offset + column - correction) should work 
> > using the token's column information since the correction only 
> > changes at tabs.
> > 
> > Adding more state to the lexer is something that is better 
avoided.
> > 
> > --Loring
> > 
> > 
> > 
> > --- In antlr-interest at yahoogroups.com, "cj_daly" <cj_daly at y...> 
> > wrote:
> > > Hi Antlr Maintainers,
> > > 
> > > For my purposes currently it's more important to have the 
absolute
> > > offset into the input file for each token than to have the
> > > line/column.  To get what I want I've been calling
> > > 
> > > lexer.setColumn(0);
> > > lexer.setTabSize(1);
> > > 
> > > before the parse and then calls to getColumn() return the 
offset I
> > > need.  But this means I never call newline() because that 
would 
> > reset
> > > the column counter and thus I can't have line/column info if I 
want
> > > it.
> > > 
> > > I think that it would be nice and easy have it both ways.  We 
would
> > > just need to add getOffset() and setOffset() to Token and then 
have
> > > LexerSharedInputState manage an offset counter independently 
of the
> > > line/column counters.
> > > 
> > > Does that make sense?  Am I totally missing something here 
(i.e. is
> > > the offset info I need already available somewhere)?
> > > 
> > > 
> > > Chris

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/