[antlr-interest] misunderstanding channel HIDDEN

Wed Aug 26 16:41:19 PDT 2009

Daniels, Troy (US SSA) wrote:
> Your BLAH rule doesn't know that it can call UCODE between characters.
> You want something like this.
> 
> startrule: blah;  /* Probably also want to include EOF here, otherwise
> the parser will successfully run against "blahblah" */
> 
> blah: B L A H;
> UCODE   : '\u0000'{ $channel = HIDDEN; };
> B: 'b';
> L: 'l';
> A: 'a';
> H: 'h';

This won't work; the "$channel = HIDDEN;" will set the channel of the
entire token to HIDDEN if it contains any '\u0000' characters (and those
characters will still be present in the token's .text field, if that
matters).

If Gavin Lambert is correct that the zero bytes occur because the input
is UTF-16, then using a UTF-16 reader is definitely the right approach:

  Reader reader = new java.io.InputStreamReader(inputstream, "UTF-16");
  ANTLRReaderStream in = new ANTLRReaderStream(reader);

(This will auto-detect byte order if a Byte Order Mark is present, but
won't auto-detect between UTF-8 and UTF-16; you'll need extra code to do
that.)

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com