[antlr-interest] Revisited: Ending line/column of a token

Sun Nov 4 19:32:40 PST 2012

I keep an array (int[]) of line start offsets - offsets[i] is the index in the input stream of the first character on line i. A binary search allows the lookup of line/column information for any index in O(log n) time. If you also remove the line/column numbers stored in the Token implementation, every token shrinks in size too.

--
Sam Harwell
Owner, Lead Developer
http://tunnelvisionlabs.com

-----Original Message-----
From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Peter S. May
Sent: Sunday, November 04, 2012 8:47 PM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Revisited: Ending line/column of a token

Hiya, folks—

About a year ago a question was brought up regarding recovering the final (as well as the initial) line and column of a token to make it available e.g. to an AST or an action, and that it's undesirable to have to resort to counting characters and newlines:

http://www.antlr.org/pipermail/antlr-interest/2011-October/043116.html

It appears as if nobody ever answered.

The workaround I'm currently pursuing is promising but still somewhat
heinous: For any lexer rule that might contain newlines (multi-line strings/comments/et c.), instead of completing normally, a token is emitted containing the text proper followed by an artificial, zero-width "EndFinder" token.

The basic mechanism is based on the multi-emit lexer described at http://www.antlr.org/wiki/pages/viewpage.action?pageId=3604497. A zero-width fragment rule is forced into existence as an actual type in a manner similar to this:

	fragment EndFinder : ; // zero-width

	fragment SomeMultiLineThingText :
		'<<'
		( options {greedy=false;} : . )*
		'>>'
		;

	SomeMultiLineThing : t=SomeMultiLineThingText z=EndFinder
		{
			// The setType()s are necessary for
			// fragment-originated tokens, so
			// I gather
			$t.setType(SomeMultiLineThing);
			emit($t);
			$z.setType(EndFinder);
			emit($z);
		}
		;

And it actually seems to work, which surprised me. The unfortunate side effect is that each parser rule using one of these has to match two tokens where it would be more obvious to match just one:

	someMultiLineThing : t=SomeMultiLineThing z=EndFinder
		{
			// ...
			// $t.line and $t.pos are at the start
			// $z.line and $z.pos are at the end
			// ...
		}
		;

So my question now is this: Is there a "right", or at least less brain-damaged, way to accomplish the same thing?

Thanks, and enjoy
PSM

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address