[antlr-interest] misunderstanding channel HIDDEN

Ian Eyberg ian at telematter.com
Thu Aug 27 12:43:36 PDT 2009


Thanks everyone,

  took a tad bit of reading and lots of cursing
but I eventually got it to do this:

basically I removed the refs. to the UCODE rule
in my grammars.

I set antlrinputstream to receive "UTF8"

  ANTLRInputStream input = new ANTLRInputStream(sin, "UTF-8");

and I rewrite my UTF-16 as UTF-8 if I find it
in my incoming files before I parse it..

  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);

    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      String asString = new String(contents, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }


it'd be wise to go ahead and include the other common
encodings in this but this got me going to do what I 
wanted..

thanks again,
Ian

On 09:07 Thu 27 Aug     , Gavin Lambert wrote:
> At 06:13 27/08/2009, Ian Eyberg wrote:
> >I have text that looks like:
> >
> >  'b^@l^@a^@h^@'
> >
> >(most of the time the text is simply 'blah')
> >and then it should come out like this:
> >
> >  'blah'
> [...]
> >  UCODE   : '\u0000'{ $channel = HIDDEN; };
> >
> >I'm reading in through antlrinputstream as "UTF8" as I do
> >want to support multi-byte chars and I have rules to help
> >that such as:
> 
> I think you're going about this the wrong way.  The input above
> looks like UTF-16; you should detect that case and use a UTF16 file
> stream instead of a UTF8 one.  (Normally Unicode files will start
> with a BOM you can use for auto-detection.)
> 
> UTF-16 and UTF-8 encode high-order Unicode characters quite
> differently, so if your input can include them then trying to read
> it as UTF8 and just throwing away the nulls definitely isn't going
> to work.
> 


More information about the antlr-interest mailing list