[antlr-interest] misunderstanding channel HIDDEN

Wed Aug 26 16:29:04 PDT 2009

Ian Eyberg wrote:
> Hi,
>   I think I'm misunderstanding the usage of $channel = HIDDEN
> or skip().
> 
> I have text that looks like:
> 
>   'b^@l^@a^@h^@'
> 
> (most of the time the text is simply 'blah')
> and then it should come out like this:
> 
>   'blah'
> 
> my relevant rules are:
> 
>   startrule : BLAH;
>   BLAH    : 'blah';
>   UCODE   : '\u0000'{ $channel = HIDDEN; };
> 
> I'm reading in through antlrinputstream as "UTF8" as I do
> want to support multi-byte chars and I have rules to help
> that such as:
> 
> UNICODE : ('\u00a0'..'\uffff');
> 
> What am I doing wrong here?

Using a hidden channel won't work if you want 'blah' to be a single
token. The '$channel = HIDDEN;' in the action for UCODE sets the channel
for that token, but does not otherwise affect lexing, so you will end up
with a token stream like:

  'b' <hidden '^@'> 'l' <hidden '^@'> 'a' <hidden '^@'> 'h' <hidden '^@'>

It is possible to ignore characters within a token, but it requires more
work. If you only have to ignore NULs within an identifier, say, then
consider something like:

  Identifier
    @init { StringBuilder sb = new StringBuilder(); }
    : ( c='a'..'z' { sb.append((char) c); } | '\u0000' )*
        { state.text = sb.toString(); }
    ;

If you have to ignore certain characters or bytes anywhere, then I suggest
using a filtering InputStream (in the case of UTF-8; for UTF-16 it would
be a Reader) that strips them out before they get to the lexer.
Providing your own subclass of one of the ANTLR stream classes could
also be made to work, but is probably no simpler in this situation.

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com