[antlr-interest] Python target v3 unicode problems

Fri Sep 14 06:32:56 PDT 2007

Gavin Lambert wrote:

> At 09:55 14/09/2007, Viðar Svansson wrote:
>  >Here, the transform function scans, and parses
> the string. Before
>  >I used ANTLRFileStream(path, encoding='utf-8'),
> it would fail on
>  >the scanning. Now the lexer works but the test fails in the end
>  >with this output:
>  >
>  >Failed example:
>  >    tree
>  >Expected:
>  >    {u'author' : u'Viðar Svansson'}
>  >Got:
>  >    {u'author': u'Vi\xc3\xb0ar Svansson'}
>  >
>  >I am not sure what is wrong here, never seen
> seen hex values inside
>  >a unicode encoding string. Any ideas?
> 
> Well, I haven't actually looked it up to see if
> it matches, but C3 B0 seems like a double-byte
> UTF-8 sequence to me,

That's correct.

> and thus exactly what it 
> should be doing, since you told it to deal with UTF-8.

That's not correct. The input stream decodes utf-8 from the input file to
unicode. Past that point the generated lexer/parser code and the runtime
module do not do any encoding/decoding. Strings comming from ANTLR
(Token.text et al) should always be unicode instances.

-- 
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/