[antlr-interest] Python target v3 unicode problems

Thu Sep 13 13:50:20 PDT 2007

Hi Viðar,

sorry for the late reply, gmane ate my post and I didn't notice until now.

Viðar Svansson wrote:

> I am trying to scan a UTF-8 file using the python target but with no luck.
> I usually get this error:
> 
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 0: ordinal not in range(128)
> 
> I have tried to cast the strings to unicode everywhere but I nothing
> seems to work. I have also tried many different declarations of the
> unicode tokens but nothing seems to work there either.
> 
> I found reference to something like this:
> 
> class L extends Lexer;
> options {
> charVocabulary = '\3'..'\377' | '\u1000'..'\u1fff';
> }
> 
> But I think this is v2 syntax, correct?
> Does anyone have a working unicode lexer in python?

You must feed unicode data into the lexer. So if you are using
ANTLRFileStream, use something like
  ANTLRFileStream(path, encoding='utf-8')
No vocabulary has to be declared, it's always full unicode (erm.. almost, I
think ANTLR is limited to 16 bit, never tried what happens if you feed it a
codepoint beyond U+FFFF...).

HTH

-- 
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/