[antlr-interest] misunderstanding channel HIDDEN
David-Sarah Hopwood
david-sarah at jacaranda.org
Thu Aug 27 14:59:33 PDT 2009
Ian Eyberg wrote:
[...]
> I set antlrinputstream to receive "UTF8"
>
> ANTLRInputStream input = new ANTLRInputStream(sin, "UTF-8");
>
> and I rewrite my UTF-16 as UTF-8 if I find it
> in my incoming files before I parse it..
>
> try {
> FileInputStream fis = new FileInputStream(args[args.length-1]);
> byte[] contents = new byte[fis.available()];
> fis.read(contents, 0, contents.length);
>
> if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
You probably want to accept little-endian UTF-16 (as sometimes produced
by Windows systems) here, given that it's very easy to do:
if ( (contents[0] == (byte)0xFF && contents[1] == (byte)0xFE) ||
(contents[0] == (byte)0xFE && contents[1] == (byte)0xFF) ) {
The "UTF-16" encoding specified in the String constructor will use the
BOM to detect byte order, as documented at
<http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html>.
> String asString = new String(contents, "UTF-16");
> byte[] newBytes = asString.getBytes("UTF8");
"UTF8" is accepted as an alias for "UTF-8", but I'd use the latter for
consistency.
> FileOutputStream fos = new FileOutputStream(args[args.length-1]);
> fos.write(newBytes);
> fos.close();
> }
>
> fis.close();
> } catch(Exception e) {
> e.printStackTrace();
> }
--
David-Sarah Hopwood ⚥ http://davidsarah.livejournal.com
More information about the antlr-interest
mailing list