[antlr-interest] Unicode handling
Mark Lentczner
markl at glyphic.com
Thu Apr 22 13:12:43 PDT 2004
Okay, given what I've heard here, and based on some quick experiments,
this is what I'm going to do:
I'm building a C++ generated lexer in Antlr that parses UTF-8 input.
1) This means that the C++ code has to be able to handle matching bytes
with the high bit set: 0xBF. So far, this looks like it won't be a big
issue - either it works or the Antlr support lib will need some small
tweaking to make sure all is 8-bit clean. (Any changes will be comin'
your way, Ric.)
2) The C++ generated code for such things looks okay except that the
bit sets look twice as long as they should be (512 bits), even though
all the correct bits are set.
3) I am using '\u00BF' and the like in my Antlr grammar only because I
don't like octal, '\277'. I do wish Antlr supported a third character
escape that was hex '\xBF', since I'm not really matching Unicode here.
Of course, none of this makes any difference in the generated code...
- Mark
Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list