[antlr-interest] misunderstanding channel HIDDEN

Gavin Lambert antlr at mirality.co.nz
Wed Aug 26 14:07:26 PDT 2009


At 06:13 27/08/2009, Ian Eyberg wrote:
 >I have text that looks like:
 >
 >  'b^@l^@a^@h^@'
 >
 >(most of the time the text is simply 'blah')
 >and then it should come out like this:
 >
 >  'blah'
[...]
 >  UCODE   : '\u0000'{ $channel = HIDDEN; };
 >
 >I'm reading in through antlrinputstream as "UTF8" as I do
 >want to support multi-byte chars and I have rules to help
 >that such as:

I think you're going about this the wrong way.  The input above 
looks like UTF-16; you should detect that case and use a UTF16 
file stream instead of a UTF8 one.  (Normally Unicode files will 
start with a BOM you can use for auto-detection.)

UTF-16 and UTF-8 encode high-order Unicode characters quite 
differently, so if your input can include them then trying to read 
it as UTF8 and just throwing away the nulls definitely isn't going 
to work.



More information about the antlr-interest mailing list