[antlr-interest] Revisited: Ending line/column of a token

Sun Nov 4 18:47:03 PST 2012

Hiya, folks—

About a year ago a question was brought up regarding recovering the
final (as well as the initial) line and column of a token to make it
available e.g. to an AST or an action, and that it's undesirable to
have to resort to counting characters and newlines:

http://www.antlr.org/pipermail/antlr-interest/2011-October/043116.html

It appears as if nobody ever answered.

The workaround I'm currently pursuing is promising but still somewhat
heinous: For any lexer rule that might contain newlines (multi-line
strings/comments/et c.), instead of completing normally, a token is
emitted containing the text proper followed by an artificial,
zero-width "EndFinder" token.

The basic mechanism is based on the multi-emit lexer described at
http://www.antlr.org/wiki/pages/viewpage.action?pageId=3604497. A
zero-width fragment rule is forced into existence as an actual type in
a manner similar to this:

	fragment EndFinder : ; // zero-width

	fragment SomeMultiLineThingText :
		'<<'
		( options {greedy=false;} : . )*
		'>>'
		;

	SomeMultiLineThing : t=SomeMultiLineThingText z=EndFinder
		{
			// The setType()s are necessary for
			// fragment-originated tokens, so
			// I gather
			$t.setType(SomeMultiLineThing);
			emit($t);
			$z.setType(EndFinder);
			emit($z);
		}
		;

And it actually seems to work, which surprised me. The unfortunate side
effect is that each parser rule using one of these has to match two
tokens where it would be more obvious to match just one:

	someMultiLineThing : t=SomeMultiLineThing z=EndFinder
		{
			// ...
			// $t.line and $t.pos are at the start
			// $z.line and $z.pos are at the end
			// ...
		}
		;

So my question now is this: Is there a "right", or at least less
brain-damaged, way to accomplish the same thing?

Thanks, and enjoy
PSM