[antlr-interest] Re: proposal for 2.7.4: charVocabulary defaults
to ascii 1..127
brian-l-smith at uiowa.edu
Sun May 2 15:00:40 PDT 2004
Oliver Zeigermann wrote:
>Mike Lischke wrote:
>>>Now you seem to mix something up. Both UTF-16 and UTF-32 are
>>>character encodings as well, just as UTF-8. All of them are
>>>converted to characters before parsing.
>>Sure, but how is the internal representation? Actually, it is UTF-16. So although it is a transformation format it is
>>also the actual character representation. Hence UTF-16 (as well as UTF-32) can be processed directly. UTF-8 has to be
>>converted first to one of these formats (usually, at least). This is what I meant.
>What the internal representation is, you simply do not know and there is
>also no need to know. Certainly, it is not UTF-16 as it only allows for
>64K characters which is far to little.
In ANTLR for Java, you do know the representation and for some
applications is it important. It is a 16-bit integer described by the
'char' type. For JRE 1.2-1.4, 'char' is a 16-bit Unicode code point.
(Unicode 1.x - 3.x depending on the JRE version). In JRE 1.5, 'char' is
redefined to be a 16-bit Unicode 4.0 code unit, that may represent
either a whole character (code point), or a partial character that needs
to be combined with an adjacent one according to the UTF-16
transformation rules. See http://weblogs.java.net/pub/wlg/1202 and the
documents it references.
IMO, in order to fully support Unicode 4.0, ANTLR (for Java) would need
to replace all usages of 'char' with 'java.lang.String' or 'int.'
Yahoo! Groups Links
<*> To visit your group on the web, go to:
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
More information about the antlr-interest