[antlr-interest] Generated Java lexer not grokking Unicode
Volker Stolz
vs at iist.unu.edu
Thu Nov 27 19:49:29 PST 2008
Dear all, to a working grammar, I added some unicode matching to a
token, e.g.:
AND : 'AND' | '&&' | '\u00c3' '\u00b5'
Note that I'm only using 'õ' as an arbitrary test character. Eventually
we'd like to be able to parse a whole range of mostly mathematical
characters.
I change a working test file to contain the character in the right
place, and save it as UTF-8. hexdump -C on the input file confirms that
it's UTF-8 in the form of 7-bit characters everywhere else except for
the byte sequence ... c3 b5 ... for the new character.
I use ANTLR 3.1 to generate the code for Java, and fire up a JUnit
testcase in Eclipse:
line 12:33 no viable alternative at character 'õ'
OS is MacOS with 1.5 JDK, default charset from java.nio.charset is UTF-8
in Eclipse and MacRoman when running the test from the commandline. I
tried instantiating both plain ANTLRFileStream(filename) and
ANTLRFileStream(filename,"utf-8"). Same result in both cases (have to
set utf-8 when invoking from the command line anyway).
Any hints on how to get ANTLR to accept this? '\uc3b5' also doesn't work.
Thanks in advance,
Volker
--
United Nations University - http://rcos.iist.unu.edu/~vs/
International Institute for Software Technology Macau SAR, China
More information about the antlr-interest
mailing list