[antlr-interest] Unicode Category Question
Johannes Luber
jaluber at gmx.de
Mon Mar 24 12:27:57 PDT 2008
Darryl A. J. Staflund schrieb:
> Hi everyone,
>
> I am a new ANTLR user and have started to write an EMCA-compliant lexer
> for C# using ANTLR 3.0.1. I know that other C# lexers exist on the site
> but I want to try writing one for myself to get a feel for ANTLR and to
> learn how to use it to deal with pre-processing directives, etc. Since
> I hope to compile the lexer using C# instead of Java, I have set my
> target language as 'CSharp' in the options of my grammar file. I then
> generate the source code using the Java-based 'org.antlr.Tool' class. I
> am using the Sun Java SDK 1.6.0_05.
>
> I have run into two difficulties with this current approach:
>
> 1. The ECMA 334 specification defines a C# 2.x Unicode escape sequence
> as follows:
>
> unicode-escape-sequence::
> \u hex-digit hex-digit hex-digit hex-digit
> \U hex-digit hex-digit hex-digit hex-digit hex-digit
> hex-digit hex-digit hex-digit
>
> Although the current (?) Java specification handles the first option
> just fine, it handles the second option a bit differently as stated in
> http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1:
>
> "The Unicode standard was originally designed as a fixed-width 16-bit
> character encoding. It
> has since been changed to allow for characters whose representation
> requires more than 16 bits.
> The range of legal code points is now U+0000 to U+10FFFF, using the
> hexadecimal U+n notation.
> Characters whose code points are greater than U+FFFF are called
> supplementary characters. To
> represent the complete range of characters using only 16-bit units,
> the Unicode standard defines an
> encoding called UTF-16. In this encoding, supplementary characters
> are represented as pairs of
> 16-bit code units, the first from the high-surrogates range, (U+D800
> to U+DBFF), the second from
> the low-surrogates range (U+DC00 to U+DFFF). For characters in the
> range U+0000 to U+FFFF,
> the values of code points and UTF-16 code units are the same."
>
> Since Java represents supplementary characters as pairs of 16-byte code
> units instead of as single '\Uxxxxxxxx' string sequences, I don't know
> how to parse these latter values in my ANTLR grammar. Does ANTLR's
> Java-based lexer handle Unicode supplementary characters passed to it in
> the '\Uxxxxxxxx' format? How should I handle something like this?
I refer for these questions to my dissertation, which I will will send
to you (and anyone else) off-list.
> 2. The ECMA 334 specifications defines identifiers in terms of Unicode
> character categories as follows:
>
> letter-character::
> A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
> A unicode-escape-sequence representing a character of classes Lu, Ll,
> Lt, Lm, Lo, or Nl
>
> combining-character::
> A Unicode character of classes Mn or Mc
> A unicode-escape-sequence representing a character of classes Mn or Mc
>
> decimal-digit-character::
> A Unicode character of the class Nd
> A unicode-escape-sequence representing a character of the class Nd
>
> connecting-character::
> A Unicode character of the class Pc
> A unicode-escape-sequence representing a character of the class Pc
>
> formatting-character::
> A Unicode character of the class Cf
> A unicode-escape-sequence representing a character of the class Cf
>
> I have read various posts in the ANTLR newsgroups on how to recognize
> these categories and see that a number of strategies have been discussed:
>
> - Define the categories in terms of Unicode escape sequences, eg.
> "WHITESPACE: ('\u0020' | ('\u2000'.'\u0200A') | ...etc...);".
> - Define the categories in terms of semantic predicates, eg.
> "WHITESPACE: { IsUnicodeCategoryZs (LA (1)) }? ;".
>
> The difficulty with the first approach in regard to the C# specification
> is that some Unicode character categories (ex: Nd) contain
> supplementary characters. This means that in order to represent them in
> a Java-friendly manner, I need to convert them into pairs of Unicode
> characters. Ugghh! I don't mind the second option, although it sounds
> as though it will run slower.
>
> Since the newer regular expression engines used by Perl, Java, C#, etc.
> have been built to match on Unicode, could ANTLR's EBNF be extended to
> match on the following:
>
> - Unicode Character Properties (i.e.: \p{Lu}, \p{Mn}, etc...)
> - Unicode Scripts (i.e. \p{Common}, \p{Arabic}, etc...)
> - Unicode Blocks (i.e. \p{Currency_Symbol}, \p{InBasic_Latin}, ...)
>
>
> If this were done, we could use them in the ANTLR parser as so:
>
> letter-character:
> : \p{Lu}
> | \p{Ll}
> | \p{Lt}
> | \p{Lm}
> | \p{Lo}
> | \p{Nl}
>
> etc...
>
> We could also use them to do the following:
>
> - Define characters in terms of negations (ex: "TOKEN: \p{^Lu} |
> \p{^Letter};)
> - Match letters including diacritics (ex: LETTER_AND_DIACRITIC: \p{L}
> \p{M}*)
An interesting enhancement, but unnecessary for the problem, as
described in my dissertation.
Johannes
More information about the antlr-interest
mailing list