[antlr-interest] UCS-2/UTF-16 clarification
Wincent Colaiuta
win at wincent.com
Thu Jun 7 12:38:33 PDT 2007
Sorry to be bothering you with so many questions, Jim... I just
wanted to ask you for some quick clarification about the UCS-2/UTF-16
support in the C language target. I'm initializing my stream using
antlr3NewUCS2StringInPlaceStream (the string is already in memory,
not coming in from a file).
Is it correct that the C target is effectively "encoding agnostic"?
ie. it doesn't really care what your input coding is, it just
operates on 16 bit integers? In other words, if I ensure that I
really do hand it UCS-2-encoded input then it will just do the right
thing?
My actual grammars are going to be in ASCII, even though the input
text they are expected to process could conceivably be in another
encoding, and ANTLR will convert those grammars into C source files
which again are just ASCII.
I just wanted to clarify this because of the minor differences
between UCS-2 and UTF-16; they overlap for the most part but only
UTF-16 supports surrogates and some glyphs require two characters,
whereas UCS-2 only ever users single 16 bit characters and no
surrogates (simpler but can't encode as many code points). But as far
as the C target is concerned, it's just working with 16-bit integers,
right? So in reality, as long as I hand it properly encoded UCS-2 I
shouldn't have any problems? (I seriously doubt I'll ever have to
handle input which can't be encoded in UCS-2.)
Cheers,
Wincent
More information about the antlr-interest
mailing list