[antlr-interest] Lazy load of CommonTokenStream??

Kay Röpke kroepke at classdump.org
Mon Aug 18 04:11:53 PDT 2008


Hi!

On Aug 18, 2008, at 10:49 AM, Vitaliy wrote:

> BTW, could LA(1) fail somehow - if there is no next token,
> or if it goes beyond the last token, or something like that?


LA(1) will return Token.EOF (i.e. the EOF token type) if there's no  
next token.
That can happen if there are no tokens at all or you are actually at  
the end of the buffer.

Use LT(1) to get the acutal Token object (in which case the above  
would be Token.EOF_TOKEN, the singleton EOF token).
LA internally calls LT and returns the token type.

Please note that LA(0) and LT(0) are undefined. LT(0) will return null  
and LA(0) will throw a NullPointerException. _Never_ call them with 0  
as the argument, simply because it has undefined behavior.

For completeness' sake: Negative arguments are ok, they will trigger a  
call to the protected LB() (look-behind). That of course only works  
for TokenStreams that do bufferring, and might lead to NPEs if used  
with LA(), because LA doesn't check the return value before trying to  
invoke getType() on it. LT(-k) returns null if you fall off the  
beginning of the token buffer.
The lack of checks in LA is most likely due to the performance impact  
that check would have, LA is called often, and it's seldomly used  
outside of generated code. The only place you might have to use it is  
in predicates and that's most likely only in the lexer (operating on  
the character stream) to figure out which token to generate, often in  
relation to whitespace checks (e.g. MySQL has this weird notion of  
requiring function calls to have no whitespace in between the function  
name and the opening '(' to resolve ambiguities with keywords - it's  
configurable, though, making it even worse).

Now when I look at the code, there might be another bug or two  
lurking. At least minor issues, if they are not bugs.
Let the stream be positioned on the first token.
1) For CharStreams, LA(-1) will return CharStream.EOF. I think that's  
at least inconsistent and should return INVALID_CHAR (which doesn't  
exist right now), because it's not EOF, technically.
2) For TokenStreams, LA(-1) will throw a NullPointerException, because  
LB(1) returns null. To be consistent LB should return  
Token.INVALID_TOKEN, thus causing LA(-1) to return  
Token.INVALID_TOKEN_TYPE. That way there's no extra check and no  
exception being thrown, making all calls (except LA(0)) to those  
methods safe from an exception point of view.
3) really minor: the naming scheme for Token.EOF (an int) and  
Token.INVALID_TOKEN_TYPE (also an int) is slightly off, but it's  
probably WorldOfPain(tm) to change it, so let's not bother. :)

Opinions?

cheers,
-k
-- 
Kay Röpke
http://classdump.org/








More information about the antlr-interest mailing list