[antlr-interest] Unicode lexing
Jonathan S. Shapiro
shap at eros-os.org
Tue Mar 9 15:58:50 PST 2010
Well *that* was weird. Sorry for the mis-send.
I know this topic has come up before, and sorry to bring it up again.
Context: I'm bringing up BitC on CLI, and planning to use antlr to do it.
BitC characters cover the full unicode (20 bit) range.
The good news:
1. Characters above U+FFFF can only appear in character and string
literals.
2. The language requires that units of compilation be encoded in UTF-8.
3. Both JVM and CLI carry strings as UTF-16, so if we carry character
literals around as string payloads we can deal with this internally.
4. Outside of character and string literals the legal input characters
all fall within the 16-bit UNICODE subset.
When we dealt with this in the current, yacc-based implementation, we
proceeded as follows:
1. We hand-wrote the lexer and had it process the raw input as a byte
stream. We then hand-decoded UTF-8 sequences as appropriate.
2. To carry around string literal values we encoded them internally as
UTF-8 (because this was C). In JVM/CLR, obviously, we would encode in
UTF-16.
3. We internally carted character literal values around as an unsigned
32-bit integer.
So basically, we found that an "arm's length unicode" approach worked out
for us. I had thought to adopt a similar approach with Antlr.
I've been reading the Antlr Reference book, and I noted a comment to the
effect that if you hand-write a lexer you lose the ability to do certain
kinds of lookahead. Is this the case, or is it possible to hand-write a
lexer in a fashion that cooperates with the regular behavior of Antlr?
Thanks
Jonathan
More information about the antlr-interest
mailing list