[antlr-interest] antlr v4 wish list

Sam Harwell sharwell at pixelminegames.com
Wed Mar 30 07:21:27 PDT 2011


Hi Matt,

I agree except that it should hold the length of the token in the input
stream instead of the end (where end=start+length-1). This allows insertion
of zero-length tokens such as the implied ; tokens in Go (language by
Google) without storing confusing positions like (start=3, end=2) - instead
you get (start=3, length=0). Length and End are technically equivalent, but
length leaves no questions about endpoints (inclusive/exclusive, and what to
do with zero-length tokens). In my first comment, I was simply stating that
you only need to store 2 values (start/length) instead of 4
(start/length/line/column), because the latter two can be efficiently
derived from the former.

In ANTLR v4 the line map I mentioned will likely be included in the base
lexer implementation, so it will be efficient (space and time) without
becoming a burden.

Sam

-----Original Message-----
From: Matt Fowles [mailto:matt.fowles at gmail.com] 
Sent: Wednesday, March 30, 2011 8:57 AM
To: Sam Harwell
Cc: Martin d'Anjou; antlr-interest at antlr.org
Subject: Re: [antlr-interest] antlr v4 wish list

Sam~

A token needs to know both start and end position.  Especially when you add
in the restriction that *synthetic* tokens should respond with the positions
for the entire rule that created them (if they weren't based on another
token).  Basically, you need Tree and Token to always be able to provide
locations in the original stream (even if those locations are best guess)
regardless of how many tree transformations have taken place.  Whether it
internally uses a shared array of line offsets or stores duplicates in every
token, I don't care, but pushing all of that onto every language implementer
is not a good trade off.

Matt

On Tue, Mar 29, 2011 at 11:29 PM, Sam Harwell <sharwell at pixelminegames.com>
wrote:
> Hi Martin,
>
> Replying to the individual points:
>
> 1. A token only needs to know the start position in the input stream 
> and the length. Considering a file may easily have hundreds of 
> thousands of tokens, it's very important to not add any information to 
> the token that can be efficiently derived in another manner, 
> especially if that information is infrequently used by applications. 
> For example, the line/column information can be efficiently derived if 
> the lexer maintains an internal array of line offsets (index 0 
> contains 0, the start position of line 0; index 1 contains the offset to
the start of line 1; etc...).
>
> 3. The current notation is pretty simple once you see it. Also, it's 
> well documented in the books.
>
> 4. With proper integration into the build system, generated files 
> aren't checked into source control or distributed. The ANTLR project 
> itself generates V2 and V3 grammars, and my .NET projects generate V3 
> grammars (using my C# port of the Tool) at build time, so the 
> generated files never take up space in source control.
>
> Sam
>
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org 
> [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Martin d'Anjou
> Sent: Tuesday, March 29, 2011 9:33 PM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] antlr v4 wish list
>
> Hello,
>
> My suggestions, for what it's worth:
>
> 1) In the Runtime section:
> * Tokens and Trees should both know their start/stop line, start/stop 
> char position to make IDEs easier.
>
> Not only IDEs, but for also for debugging on the command line in a
terminal.
> The file name is also needed.
>
> 2) Lexer debug enhancement:
> Option on the lexer constructor to have the lexer print some debug info:
> token type by name, token value, filename, line and char position, 
> without having to replace antlr's built-in classes.
>
> 3) General:
> I have spent many hours on a ridiculous little problem: the grammar 
> declaration statement! So I suggest enforcing the grammar type in the 
> grammar declaration:
> parser grammar MyGrammar;
> lexer grammar MyGrammar;
> mixed grammar MyGrammar;  // lexer and parser grammar tree grammar 
> MyGrammar;
>
> 4) Gigantic source files, as described here:
> http://v2kparse.blogspot.com/2008/06/first-pass-uploaded-to-sourceforc
> e.html
> Maybe this has been solved already?
>
> Regards,
> Martin d'Anjou
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: 
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>



More information about the antlr-interest mailing list