[antlr-interest] Lazy load of CommonTokenStream??

Mon Aug 18 05:43:51 PDT 2008

Thanks for the detailed explanation!

IMHO ANTLR rocks! It's a great tool with so much power to it, and it's wonderful that it's an open source project.
I just wish it would have been a little bit better documented..

Anyway,
Thanks again for all your help,
Vitaliy

-----Original Message-----
From: Kay Röpke [mailto:kroepke at classdump.org]
Sent: Monday, August 18, 2008 13:12
To: Vitaliy
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Lazy load of CommonTokenStream??

Hi!

On Aug 18, 2008, at 10:49 AM, Vitaliy wrote:

> BTW, could LA(1) fail somehow - if there is no next token,
> or if it goes beyond the last token, or something like that?

LA(1) will return Token.EOF (i.e. the EOF token type) if there's no
next token.
That can happen if there are no tokens at all or you are actually at
the end of the buffer.

Use LT(1) to get the acutal Token object (in which case the above
would be Token.EOF_TOKEN, the singleton EOF token).
LA internally calls LT and returns the token type.

Please note that LA(0) and LT(0) are undefined. LT(0) will return null
and LA(0) will throw a NullPointerException. _Never_ call them with 0
as the argument, simply because it has undefined behavior.

For completeness' sake: Negative arguments are ok, they will trigger a
call to the protected LB() (look-behind). That of course only works
for TokenStreams that do bufferring, and might lead to NPEs if used
with LA(), because LA doesn't check the return value before trying to
invoke getType() on it. LT(-k) returns null if you fall off the
beginning of the token buffer.
The lack of checks in LA is most likely due to the performance impact
that check would have, LA is called often, and it's seldomly used
outside of generated code. The only place you might have to use it is
in predicates and that's most likely only in the lexer (operating on
the character stream) to figure out which token to generate, often in
relation to whitespace checks (e.g. MySQL has this weird notion of
requiring function calls to have no whitespace in between the function
name and the opening '(' to resolve ambiguities with keywords - it's
configurable, though, making it even worse).

Now when I look at the code, there might be another bug or two
lurking. At least minor issues, if they are not bugs.
Let the stream be positioned on the first token.
1) For CharStreams, LA(-1) will return CharStream.EOF. I think that's
at least inconsistent and should return INVALID_CHAR (which doesn't
exist right now), because it's not EOF, technically.
2) For TokenStreams, LA(-1) will throw a NullPointerException, because
LB(1) returns null. To be consistent LB should return
Token.INVALID_TOKEN, thus causing LA(-1) to return
Token.INVALID_TOKEN_TYPE. That way there's no extra check and no
exception being thrown, making all calls (except LA(0)) to those
methods safe from an exception point of view.
3) really minor: the naming scheme for Token.EOF (an int) and
Token.INVALID_TOKEN_TYPE (also an int) is slightly off, but it's
probably WorldOfPain(tm) to change it, so let's not bother. :)

Opinions?

cheers,
-k
--
Kay Röpke
http://classdump.org/

__________ Information from ESET NOD32 Antivirus, version of virus signature database 3364 (20080818) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com

__________ Information from ESET NOD32 Antivirus, version of virus signature database 3364 (20080818) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com