[antlr-interest] Antlr 3 and the newline token problem

Sat Nov 26 05:07:03 PST 2005

Prashant Deva wrote:
> Me and Terence were recently having a discussion about this.
> Its about how to handle newlines in antlr 3.

Can't you do it in a similar way as I intend to do it in C++? E.g. plug
in a user defined object that knows about the type of token tracking you
want to do? And one that knows how to put this location info inside a
newly created token?

In C++ I can now plug in an object that only deals with binary offsets
in the file. Or instantiate the lexer with a object that tracks
everything. No changes to the basic lexer code. On the test lexers (one
of them the java lexer) using full tracking or minimalist position
tracking cut 10-11% of the binary size, just by changing a template
parameter.

> Now as you probably know that currently ANTLR 2 cant handle all 3 types of
> newlines.
> ie, if we have a rule like this-
> 
> WS : '\r' '\n' {newline();}
>        | '\r'    {newline();}
>        | '\n'   {newline();}
>      ;
> 
> we would get a non determinism warning.

A warning *shrug* ;)

> The reason this problem arises is solely because currently we have chosen to
> store 'lines & columns' in tokens instead of offsets.

It's more due to the choice to mark the newlines in this way with action
 code.

> I mean, think about it this way, if we didnt have to put that newline()
> call, we could easily write this rule as-
> 
> WS : '\r' | '\n' ;
> 
> This would handle all 3 types of newlines.
> 
> So i propose that in antlr 3 you identify the position of the tokens by
> offset instead of 'line/columns'
> 
> This has the following advantages -
> 
> 
>    1. Some people may like line nos to start from 1 while some may want
>    them to start from 0, but 'offsets' are universally considered to start from
>    0.
>    2. We need to keep track of just 1 value ( the offset) instead of 2
>    values (row/column), so lesser complexity.
>    3. And of course we can handle all 3 types of 'newlines' easily :-)

And the people who want to track line/col/filename have to jump through
hoops to do that ? This is only moving the problem around. Where the
problem is: how do we tell antlr to track what in the first place.

> On the other hand Terence suggests that call to newline() can be put inside
> the CharBuffer class where it is handled automatically so people who need to
> track line nos can do so easily.

E.g. put an extra if and logic for the types of newlines (if so
required) in a critical path of the lexer? This is currently happening
in antlr2 at some spots as well and it's ugly. Someone parsing binary
stuff who'd want pure performance is then bothered with some newline
checks he'd not need.

> This would be nice but then again it increases the complexity if we decide
> to keep both offsets and row/cols.
> 
> Which approach do you think would be best?

I'd be in favour of trying to make the java stuff do what I now
do/intend in C++. e.g. have some object that knows how to track the
offset or whatever you want in the stream and combine it with some
factory that knows how to make the token from the offset tracker and the
start and end position of the token.

That way you have a consistent and transparent way in which stream
locations are tracked and the locations are put in the tokens. With some
extra tinkering you can even reduce a token to just an int if you need
or to something more fullblown that tracks line/col/file.

So for C++ if I'd want full position tracking I'd say:

typedef CharStream<FullPosition> MyStream;
class MyTokenBuilder {
public:
   static TokenFullPos* build( token_type tp,
                               const FullPosition& start,
                               const MyStream& stream,
                               channel_type chan = DEFAULT_CHANNEL )
{
   return new TokenFullPos( tp,
            stream.substring(start.getOffset(),stream.getOffset()),
            start.getLine(), start.getColumn(),
            stream.getFilename() );
}
};
...
MyStream input( file, filename );
JavaParserLexer<MyStream,TokenFullPos,MyTokenBuilder> lexer( &input );
...
TokenFullPos* t = lexer.nextToken();

For minimalist tracking I'd just plug in a different token builder and a
different Position tracker in stead of FullPosition and that's it. Doing
a few typedefs at the top of the code hides all the template parameter
details for the most common cases.

Cheers,

Ric