[antlr-interest] Unicode lexing

Tue Mar 9 15:58:50 PST 2010

Well *that* was weird. Sorry for the mis-send.

I know this topic has come up before, and sorry to bring it up again.

Context: I'm bringing up BitC on CLI, and planning to use antlr to do it.
BitC characters cover the full unicode (20 bit) range.

The good news:

   1. Characters above U+FFFF can only appear in character and string
   literals.
   2. The language requires that units of compilation be encoded in UTF-8.
   3. Both JVM and CLI carry strings as UTF-16, so if we carry character
   literals around as string payloads we can deal with this internally.
   4. Outside of character and string literals the legal input characters
   all fall within the 16-bit UNICODE subset.

When we dealt with this in the current, yacc-based implementation, we
proceeded as follows:

   1. We hand-wrote the lexer and had it process the raw input as a byte
   stream. We then hand-decoded UTF-8 sequences as appropriate.
   2. To carry around string literal values we encoded them internally as
   UTF-8 (because this was C). In JVM/CLR, obviously, we would encode in
   UTF-16.
   3. We internally carted character literal values around as an unsigned
   32-bit integer.

So basically, we found that an "arm's length unicode" approach worked out
for us. I had thought to adopt a similar approach with Antlr.

I've been reading the Antlr Reference book, and I noted a comment to the
effect that if you hand-write a lexer you lose the ability to do certain
kinds of lookahead. Is this the case, or is it possible to hand-write a
lexer in a fashion that cooperates with the regular behavior of Antlr?

Thanks

Jonathan