[antlr-interest] Python target v3 unicode problems
Benjamin Niemann
pink at odahoda.de
Fri Sep 14 06:32:56 PDT 2007
Gavin Lambert wrote:
> At 09:55 14/09/2007, Viðar Svansson wrote:
> >Here, the transform function scans, and parses
> the string. Before
> >I used ANTLRFileStream(path, encoding='utf-8'),
> it would fail on
> >the scanning. Now the lexer works but the test fails in the end
> >with this output:
> >
> >Failed example:
> > tree
> >Expected:
> > {u'author' : u'Viðar Svansson'}
> >Got:
> > {u'author': u'Vi\xc3\xb0ar Svansson'}
> >
> >I am not sure what is wrong here, never seen
> seen hex values inside
> >a unicode encoding string. Any ideas?
>
> Well, I haven't actually looked it up to see if
> it matches, but C3 B0 seems like a double-byte
> UTF-8 sequence to me,
That's correct.
> and thus exactly what it
> should be doing, since you told it to deal with UTF-8.
That's not correct. The input stream decodes utf-8 from the input file to
unicode. Past that point the generated lexer/parser code and the runtime
module do not do any encoding/decoding. Strings comming from ANTLR
(Token.text et al) should always be unicode instances.
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
More information about the antlr-interest
mailing list