[antlr-interest] Bug in Python target while using multiple lexers + island grammar

Bob Adolf rdadolf at gmail.com
Tue Mar 30 15:30:06 PDT 2010


I probably should've posted this a while ago, but I didn't get around  
to it.

There is a bug in the CommonTokenStream.getTokens() function in the  
Python target version 3.1.2 (it looks like there is a 3.2 version in  
the bug database, but 3.1.2 is the released python runtime and it  
shouldn't affect the bug anyways). The array reference that selects  
which tokens to return uses this:
	self.tokens[start:stop]
which drops the last token. My guess is that in normal cases, this is  
overlooked because the last token is EOF, and if you're calling  
getTokens() after the fact, EOF has already served its primary purpose  
and terminated the lexer. A cursory look at the java code makes me  
think that it won't have this problem, but I have not test it. A port  
of the included reproducer could answer that.

The workaround is to use the tokens list inside the CommonTokenStream  
class directly instead of getTokens().

On a side note, there is also a bug (sort of) in the code given on the  
wiki page for emitting multiple tokens (http://www.antlr.org/wiki/pages/viewpage.action?pageId=3604497 
). The method proposed builds up an array of tokens and then emits  
them one by one as it continues to go through the file. This is fine  
for non-island grammars, but if you use multiple-emit inside an island  
grammar, the lexer will happily continue munching input as it cleans  
out its emit buffer even after the EOF token is "emitted". This either  
leads to the island lexer throwing away input (since it terminates on  
EOF and tosses the remaining multi-emit buffer) or throwing an error  
(if it runs across input that it cannot understand).

I've included a reproducer (in python) which can demonstrate and gives  
a workaround for both.

Thanks,

   -Bob



-------------- next part --------------
A non-text attachment was scrubbed...
Name: BUG_python_island_grammar.tgz
Type: application/octet-stream
Size: 1918 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20100330/f9738592/attachment.obj 
-------------- next part --------------



More information about the antlr-interest mailing list