[antlr-interest] Revisited: Ending line/column of a token

Mon Nov 5 04:46:43 PST 2012

The offsets table is an interesting idea. Where (in what method) do you
record the offsets?

(In the case of my project, the token shrinking would be premature
optimization. But the table part is intriguing.)

On Mon, 5 Nov 2012 03:32:40 +0000
Sam Harwell <sam at tunnelvisionlabs.com> wrote:

> I keep an array (int[]) of line start offsets - offsets[i] is the
> index in the input stream of the first character on line i. A binary
> search allows the lookup of line/column information for any index in
> O(log n) time. If you also remove the line/column numbers stored in
> the Token implementation, every token shrinks in size too.
> 
> --
> Sam Harwell
> Owner, Lead Developer
> http://tunnelvisionlabs.com
> 
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org
> [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Peter S. May
> Sent: Sunday, November 04, 2012 8:47 PM To: antlr-interest at antlr.org
> Subject: [antlr-interest] Revisited: Ending line/column of a token
> 
> Hiya, folks—
> 
> About a year ago a question was brought up regarding recovering the
> final (as well as the initial) line and column of a token to make it
> available e.g. to an AST or an action, and that it's undesirable to
> have to resort to counting characters and newlines:
> 
> http://www.antlr.org/pipermail/antlr-interest/2011-October/043116.html
> 
> It appears as if nobody ever answered.
> 
> The workaround I'm currently pursuing is promising but still somewhat
> heinous: For any lexer rule that might contain newlines (multi-line
> strings/comments/et c.), instead of completing normally, a token is
> emitted containing the text proper followed by an artificial,
> zero-width "EndFinder" token.
> 
> The basic mechanism is based on the multi-emit lexer described at
> http://www.antlr.org/wiki/pages/viewpage.action?pageId=3604497. A
> zero-width fragment rule is forced into existence as an actual type
> in a manner similar to this:
> 
> 	fragment EndFinder : ; // zero-width
> 	
> 	fragment SomeMultiLineThingText :
> 		'<<'
> 		( options {greedy=false;} : . )*
> 		'>>'
> 		;
> 	
> 	SomeMultiLineThing : t=SomeMultiLineThingText z=EndFinder
> 		{
> 			// The setType()s are necessary for
> 			// fragment-originated tokens, so
> 			// I gather
> 			$t.setType(SomeMultiLineThing);
> 			emit($t);
> 			$z.setType(EndFinder);
> 			emit($z);
> 		}
> 		;
> 
> And it actually seems to work, which surprised me. The unfortunate
> side effect is that each parser rule using one of these has to match
> two tokens where it would be more obvious to match just one:
> 
> 	someMultiLineThing : t=SomeMultiLineThing z=EndFinder
> 		{
> 			// ...
> 			// $t.line and $t.pos are at the start
> 			// $z.line and $z.pos are at the end
> 			// ...
> 		}
> 		;
> 
> So my question now is this: Is there a "right", or at least less
> brain-damaged, way to accomplish the same thing?
> 
> Thanks, and enjoy
> PSM
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address