[antlr-interest] misunderstanding channel HIDDEN
Ian Eyberg
ian at telematter.com
Thu Aug 27 12:43:36 PDT 2009
Thanks everyone,
took a tad bit of reading and lots of cursing
but I eventually got it to do this:
basically I removed the refs. to the UCODE rule
in my grammars.
I set antlrinputstream to receive "UTF8"
ANTLRInputStream input = new ANTLRInputStream(sin, "UTF-8");
and I rewrite my UTF-16 as UTF-8 if I find it
in my incoming files before I parse it..
try {
FileInputStream fis = new FileInputStream(args[args.length-1]);
byte[] contents = new byte[fis.available()];
fis.read(contents, 0, contents.length);
if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
String asString = new String(contents, "UTF-16");
byte[] newBytes = asString.getBytes("UTF8");
FileOutputStream fos = new FileOutputStream(args[args.length-1]);
fos.write(newBytes);
fos.close();
}
fis.close();
} catch(Exception e) {
e.printStackTrace();
}
it'd be wise to go ahead and include the other common
encodings in this but this got me going to do what I
wanted..
thanks again,
Ian
On 09:07 Thu 27 Aug , Gavin Lambert wrote:
> At 06:13 27/08/2009, Ian Eyberg wrote:
> >I have text that looks like:
> >
> > 'b^@l^@a^@h^@'
> >
> >(most of the time the text is simply 'blah')
> >and then it should come out like this:
> >
> > 'blah'
> [...]
> > UCODE : '\u0000'{ $channel = HIDDEN; };
> >
> >I'm reading in through antlrinputstream as "UTF8" as I do
> >want to support multi-byte chars and I have rules to help
> >that such as:
>
> I think you're going about this the wrong way. The input above
> looks like UTF-16; you should detect that case and use a UTF16 file
> stream instead of a UTF8 one. (Normally Unicode files will start
> with a BOM you can use for auto-detection.)
>
> UTF-16 and UTF-8 encode high-order Unicode characters quite
> differently, so if your input can include them then trying to read
> it as UTF8 and just throwing away the nulls definitely isn't going
> to work.
>
More information about the antlr-interest
mailing list