[antlr-interest] UCS-2/UTF-16 clarification

Thu Jun 7 13:00:24 PDT 2007

The default 16 bit UCS2 stream is encoding unaware, never mind agnostic
;-). It just picks up the next 16 bits from the stream when asked for
the next character, and rewinds using 16 bits rather than 8. It is
UTF-16 without surrogates, unless you code your lexer to look for the
surrogates as two character pairs.

The code generator for C realizes that it must generate ASCII, so it
does not try to place any literal strings you define in your lexer in to
"abc" type strings, but in arrays of UTF32 characters, which means it
works perfectly.

So, I is Java specifying the lexer, so you get UCS2 Unicode encoding,
and your input stream basically supplies this as a stream of 32 bit
integers which are compared against the Unicode code points and it is
very efficient.

If your input encoding is not just ASCII 8 bit or UCS2, then you may
need to insert a translating version of the _LA function defined for the
input stream. A quick debug session will show you exactly what to do
until I document that bit more fully. It is more likely that you would
need this for strange 8 bit encodings that n a 16 bit input which is
more often than not UCS2 anyway. In essence then, the default streams
will cover about 90% of all cases I think.

Jim

> -----Original Message-----
> From: Wincent Colaiuta [mailto:win at wincent.com]
> Sent: Thursday, June 07, 2007 12:39 PM
> To: ANTLR mail-list
> Cc: Jim Idle
> Subject: UCS-2/UTF-16 clarification
> 
> Sorry to be bothering you with so many questions, Jim... I just
> wanted to ask you for some quick clarification about the UCS-2/UTF-16
> support in the C language target. I'm initializing my stream using
> antlr3NewUCS2StringInPlaceStream (the string is already in memory,
> not coming in from a file).
> 
> Is it correct that the C target is effectively "encoding agnostic"?
> ie. it doesn't really care what your input coding is, it just
> operates on 16 bit integers? In other words, if I ensure that I
> really do hand it UCS-2-encoded input then it will just do the right
> thing?
> 
> My actual grammars are going to be in ASCII, even though the input
> text they are expected to process could conceivably be in another
> encoding, and ANTLR will convert those grammars into C source files
> which again are just ASCII.
> 
> I just wanted to clarify this because of the minor differences
> between UCS-2 and UTF-16; they overlap for the most part but only
> UTF-16 supports surrogates and some glyphs require two characters,
> whereas UCS-2 only ever users single 16 bit characters and no
> surrogates (simpler but can't encode as many code points). But as far
> as the C target is concerned, it's just working with 16-bit integers,
> right? So in reality, as long as I hand it properly encoded UCS-2 I
> shouldn't have any problems? (I seriously doubt I'll ever have to
> handle input which can't be encoded in UCS-2.)
> 
> Cheers,
> Wincent
>