[antlr-interest] Generated Java lexer not grokking Unicode
Gavin Lambert
antlr at mirality.co.nz
Sat Nov 29 02:42:19 PST 2008
At 16:49 28/11/2008, Volker Stolz wrote:
>AND : 'AND' | '&&' | '\u00c3' '\u00b5'
>
>Note that I'm only using 'õ' as an arbitrary test character.
>Eventually we'd like to be able to parse a whole range of
>mostly mathematical characters.
[...]
>it's UTF-8 in the form of 7-bit characters everywhere else
>except for the byte sequence ... c3 b5 ... for
the new character.
[...]
>Any hints on how to get ANTLR to accept this?
'\uc3b5' also doesn't
>work.
C3 B5 is the two-byte UTF-8 encoding for that
character -- it's not the character
itself. (Maybe you should look up how UTF-8
works, eg. on Wikipedia.)
Lowercase Latin o with tilde (õ) is Unicode
character U+00F5, ie. '\u00F5'.
More information about the antlr-interest
mailing list