[antlr-interest] Python target v3 unicode problems
Benjamin Niemann
pink at odahoda.de
Thu Sep 13 13:50:20 PDT 2007
Hi Viðar,
sorry for the late reply, gmane ate my post and I didn't notice until now.
Viðar Svansson wrote:
> I am trying to scan a UTF-8 file using the python target but with no luck.
> I usually get this error:
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 0: ordinal not in range(128)
>
> I have tried to cast the strings to unicode everywhere but I nothing
> seems to work. I have also tried many different declarations of the
> unicode tokens but nothing seems to work there either.
>
> I found reference to something like this:
>
> class L extends Lexer;
> options {
> charVocabulary = '\3'..'\377' | '\u1000'..'\u1fff';
> }
>
> But I think this is v2 syntax, correct?
> Does anyone have a working unicode lexer in python?
You must feed unicode data into the lexer. So if you are using
ANTLRFileStream, use something like
ANTLRFileStream(path, encoding='utf-8')
No vocabulary has to be declared, it's always full unicode (erm.. almost, I
think ANTLR is limited to 16 bit, never tried what happens if you feed it a
codepoint beyond U+FFFF...).
HTH
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
More information about the antlr-interest
mailing list