[antlr-interest] Python target v3 unicode problems

Fri Sep 14 00:31:24 PDT 2007

Viðar Svansson wrote:
> I tried this, now I can successfully load the strings thanks. However,
> they seem to be somehow wrong after the parse. Here is my doctest:
> 
>     >>> unicode_str = u'author : "Viðar Svansson" ; '
>     >>> tree = transform(unicode_str,SymbolTable(), Decorator, Linker)
>     >>> tree
>     {u'author' : u'Viðar Svansson'}
> 
> Here, the transform function scans, and parses the string. Before I
> used ANTLRFileStream(path, encoding='utf-8'), it would fail on the
> scanning. Now the lexer works but the test fails in the end with this
> output:
> 
> Failed example:
>     tree
> Expected:
>     {u'author' : u'Viðar Svansson'}
> Got:
>     {u'author': u'Vi\xc3\xb0ar Svansson'}
> 
> I am not sure what is wrong here, never seen seen hex values inside a
> unicode encoding string. Any ideas?

Just a wild guess. There are many ways to encode the same string using 
Unicode, so two unicode strings representing the same sequence of 
characters could potentially have different sequence of bytes.

You can try to normalize strings before comparing, for example using 
Python's unicodedata.normalize function.