[antlr-interest] feature request: Token.getOffset()

Mon Dec 8 06:05:17 PST 2003

Hi,

On Sun, Dec 07, 2003 at 12:05:44AM -0000, cj_daly wrote:
> lexer.setColumn(0);
> lexer.setTabSize(1);
>
> before the parse and then calls to getColumn() return the offset I
> need.  But this means I never call newline() because that would reset
> the column counter and thus I can't have line/column info if I want
> it.

You could override newline in your lexer if you don't care about the column
offset? Then again it's a bit on the hackish side.. Loring's solution is a
bit nicer with respect to that...

> I think that it would be nice and easy have it both ways.  We would
> just need to add getOffset() and setOffset() to Token and then have
> LexerSharedInputState manage an offset counter independently of the
> line/column counters.

Sometimes it could be handy... although you'd maybe also want to track the
filename then in case you'd use some include file mechanism.. and ... and ..
Though I wonder how hard it would be implementing it in the current
framework.

> Does that make sense?  Am I totally missing something here (i.e. is
> the offset info I need already available somewhere)?

You can probably make a cleaner solution by making a special inputbuffer
which tracks the offset look at CharBuffer (at least that's what it's
called in C++) keep in mind that due to guessing you (probably) cannot just
proxy the stream offset (or java equivalent of that) this might take some
tweaking to InputBuffer (not 100% sure, if so patches adding a generic way
to extend functionality there would be welcome)

After you get a custom InputBuffer going, you can use a custom token class
and factory. Set those in the lexer, you also need to override makeToken to
create tokens with offsets. A little more work is now needed to set up the
parser/lexer (creating the inputstates).

Hmmm come to think of it.. you might already get it to work by directly
correcting the stream offset by the number of entries in the InputBuffer
queue in makeToken (combined with a custom token and its factory) unless
I'm overlooking something within the rewind/commit stuff.

Generally though antlr is used in context of line/column info's, the
absolute offsets could be nice for some types of error reporting (quoting
bits of the surroundings of the error) Then again how many people use that
(or even the recently added column stuff). If the above solution(s) work(s)
then I don't see the use of adding it per default. An example or a
modification of an existing example with this trick would be welcome though
(if it works :) ).

(/grants loring permission to throw something at him) Hmmm so we can get up
to a 20% gain in our lexers by kicking out the column stuff ;) ? Or move
the checking of tabs out of the CharScanner::consume method and handle them
similarly to newline by explicit calls in the lexer?

Cheers,

Ric
--
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893722  ----
-----+++++*****************************************************+++++++++-------
   Words fly like arrows
      as if we knew what was right and wrong. --- Chuang Tsu

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/