[antlr-interest] misunderstanding channel HIDDEN

Thu Aug 27 14:59:33 PDT 2009

Ian Eyberg wrote:
[...]
> I set antlrinputstream to receive "UTF8"
> 
>   ANTLRInputStream input = new ANTLRInputStream(sin, "UTF-8");
> 
> and I rewrite my UTF-16 as UTF-8 if I find it
> in my incoming files before I parse it..
> 
>   try {
>     FileInputStream fis = new FileInputStream(args[args.length-1]);
>     byte[] contents = new byte[fis.available()];
>     fis.read(contents, 0, contents.length);
> 
>     if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {

You probably want to accept little-endian UTF-16 (as sometimes produced
by Windows systems) here, given that it's very easy to do:

      if ( (contents[0] == (byte)0xFF && contents[1] == (byte)0xFE) ||
           (contents[0] == (byte)0xFE && contents[1] == (byte)0xFF) ) {

The "UTF-16" encoding specified in the String constructor will use the
BOM to detect byte order, as documented at
<http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html>.

>       String asString = new String(contents, "UTF-16");
>       byte[] newBytes = asString.getBytes("UTF8");

"UTF8" is accepted as an alias for "UTF-8", but I'd use the latter for
consistency.

>       FileOutputStream fos = new FileOutputStream(args[args.length-1]);
>       fos.write(newBytes);
>       fos.close();
>     }
> 
>     fis.close();
>     } catch(Exception e) {
>       e.printStackTrace();
>   }

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com