[antlr-interest] UCS-2/UTF-16 clarification

Thu Jun 7 12:38:33 PDT 2007

Sorry to be bothering you with so many questions, Jim... I just  
wanted to ask you for some quick clarification about the UCS-2/UTF-16  
support in the C language target. I'm initializing my stream using  
antlr3NewUCS2StringInPlaceStream (the string is already in memory,  
not coming in from a file).

Is it correct that the C target is effectively "encoding agnostic"?  
ie. it doesn't really care what your input coding is, it just  
operates on 16 bit integers? In other words, if I ensure that I  
really do hand it UCS-2-encoded input then it will just do the right  
thing?

My actual grammars are going to be in ASCII, even though the input  
text they are expected to process could conceivably be in another  
encoding, and ANTLR will convert those grammars into C source files  
which again are just ASCII.

I just wanted to clarify this because of the minor differences  
between UCS-2 and UTF-16; they overlap for the most part but only  
UTF-16 supports surrogates and some glyphs require two characters,  
whereas UCS-2 only ever users single 16 bit characters and no  
surrogates (simpler but can't encode as many code points). But as far  
as the C target is concerned, it's just working with 16-bit integers,  
right? So in reality, as long as I hand it properly encoded UCS-2 I  
shouldn't have any problems? (I seriously doubt I'll ever have to  
handle input which can't be encoded in UCS-2.)

Cheers,
Wincent