[antlr-interest] Unicode & C++ & ANTLR2 (and a bit 3)

Pete Forman pete.forman at westerngeco.com
Tue Jul 6 08:06:10 PDT 2004


At 2004-07-06 15:56 +0200, Ric Klaren wrote:
>ANTLR2 needs to support 32 bit escapes in the the lexer to support full
>unicode. (currently can't specify values above \uFFFF could opt to
>introduce a new escape syntax that support variable length hex values
>\u{(HEXDIGIT)+} or something)

32 bits would be enough to support all Unicode code points but is that
adequate?  You may want to work with grapheme clusters which consist
of one or more code points.  Loosely, a grapheme cluster is what the
end user might call a character.  'e' and combining acute are two code
points but one grapheme cluster.  That particular combination can be
normalized to a single code point but many cannot.

Issues with code points above the BMP are similar to those with
combining characters.

I've been looking at how Java (JDK 1.5), C# and ICU deal with
encodings.  They all use strings of 16 bit characters (UTF-16) as the
prime units to operate on.  Single characters are available in 32
bits (UTF-32).  Grapheme clusters are available as UTF-16 strings;
the length of those is generally one, but two for surrogate pairs
and two or more where combining characters are involved.

A question to ask is what is best for ANTLR.  It is not a word
processor or text renderer.  The lexer is converting characters into
tokens.  These tokens tend to be either punctuation, whitespace,
comments, keywords, or literals.  The last three are fairly opaque.
We need to be able to some sort of isKeyword test on the last two
but aside from that their contents should be of little interest to
the lexer.

My preference would be to stick with UTF-16 strings in ANTLR for
compatibility with current languages and libraries.  Leave it up to
the grammar writers to deal with >16 bit issues where they arise.

-- 
Pete Forman                -./\.-  Disclaimer: This post is originated
WesternGeco                  -./\.-   by myself and does not represent
pete.forman at westerngeco.com    -./\.-   opinion of Schlumberger, Baker
http://petef.port5.com           -./\.-   Hughes or their divisions.



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list