[antlr-interest] Unicode handling

Thu Apr 22 13:12:43 PDT 2004

Okay, given what I've heard here, and based on some quick experiments, 
this is what I'm going to do:

I'm building a C++ generated lexer in Antlr that parses UTF-8 input.

1) This means that the C++ code has to be able to handle matching bytes 
with the high bit set: 0xBF.  So far, this looks like it won't be a big 
issue - either it works or the Antlr support lib will need some small 
tweaking to make sure all is 8-bit clean.  (Any changes will be comin' 
your way, Ric.)

2) The C++ generated code for such things looks okay except that the 
bit sets look twice as long as they should be (512 bits), even though 
all the correct bits are set.

3) I am using '\u00BF' and the like in my Antlr grammar only because I 
don't like octal, '\277'.  I do wish Antlr supported a third character 
escape that was hex '\xBF', since I'm not really matching Unicode here. 
  Of course, none of this makes any difference in the generated code...

- Mark

Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/