[antlr-interest] Generated Java lexer not grokking Unicode

Gavin Lambert antlr at mirality.co.nz
Sat Nov 29 02:42:19 PST 2008


At 16:49 28/11/2008, Volker Stolz wrote:
 >AND   :   'AND' | '&&' | '\u00c3' '\u00b5'
 >
 >Note that I'm only using 'õ' as an arbitrary test character.
 >Eventually we'd like to be able to parse a whole range of
 >mostly mathematical characters.
[...]
 >it's UTF-8 in the form of 7-bit characters everywhere else
 >except for the byte sequence ... c3 b5 ... for 
the new character.
[...]
 >Any hints on how to get ANTLR to accept this? 
'\uc3b5' also doesn't
 >work.

C3 B5 is the two-byte UTF-8 encoding for that 
character -- it's not the character 
itself.  (Maybe you should look up how UTF-8 
works, eg. on Wikipedia.)

Lowercase Latin o with tilde (õ) is Unicode 
character U+00F5, ie. '\u00F5'.



More information about the antlr-interest mailing list