[antlr-interest] Generated Java lexer not grokking Unicode

Volker Stolz vs at iist.unu.edu
Thu Nov 27 19:49:29 PST 2008


Dear all, to a working grammar, I added some unicode matching to a 
token, e.g.:

AND   :   'AND' | '&&' | '\u00c3' '\u00b5'

Note that I'm only using 'õ' as an arbitrary test character. Eventually 
we'd like to be able to parse a whole range of mostly mathematical 
characters.

I change a working test file to contain the character in the right 
place, and save it as UTF-8. hexdump -C on the input file confirms that 
it's UTF-8 in the form of 7-bit characters everywhere else except for 
the byte sequence ... c3 b5 ... for the new character.

I use ANTLR 3.1 to generate the code for Java, and fire up a JUnit 
testcase in Eclipse:

line 12:33 no viable alternative at character 'õ'

OS is MacOS with 1.5 JDK, default charset from java.nio.charset is UTF-8 
in Eclipse and MacRoman when running the test from the commandline. I 
tried instantiating both plain ANTLRFileStream(filename) and 
ANTLRFileStream(filename,"utf-8"). Same result in both cases (have to 
set utf-8 when invoking from the command line anyway).

Any hints on how to get ANTLR to accept this? '\uc3b5' also doesn't work.

Thanks in advance,
  Volker

-- 
United Nations University -                       http://rcos.iist.unu.edu/~vs/
International Institute for Software Technology   Macau SAR, China



More information about the antlr-interest mailing list