[antlr-interest] Unicode Category Question
Darryl A. J. Staflund
darryl.aj.staflund at shaw.ca
Sat Mar 22 13:52:39 PDT 2008
Hi everyone,
I am a new ANTLR user and have started to write an EMCA-compliant lexer
for C# using ANTLR 3.0.1. I know that other C# lexers exist on the site
but I want to try writing one for myself to get a feel for ANTLR and to
learn how to use it to deal with pre-processing directives, etc. Since
I hope to compile the lexer using C# instead of Java, I have set my
target language as 'CSharp' in the options of my grammar file. I then
generate the source code using the Java-based 'org.antlr.Tool' class. I
am using the Sun Java SDK 1.6.0_05.
I have run into two difficulties with this current approach:
1. The ECMA 334 specification defines a C# 2.x Unicode escape sequence
as follows:
unicode-escape-sequence::
\u hex-digit hex-digit hex-digit hex-digit
\U hex-digit hex-digit hex-digit hex-digit hex-digit
hex-digit hex-digit hex-digit
Although the current (?) Java specification handles the first option
just fine, it handles the second option a bit differently as stated in
http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1:
"The Unicode standard was originally designed as a fixed-width
16-bit character encoding. It
has since been changed to allow for characters whose representation
requires more than 16 bits.
The range of legal code points is now U+0000 to U+10FFFF, using the
hexadecimal U+n notation.
Characters whose code points are greater than U+FFFF are called
supplementary characters. To
represent the complete range of characters using only 16-bit units,
the Unicode standard defines an
encoding called UTF-16. In this encoding, supplementary characters
are represented as pairs of
16-bit code units, the first from the high-surrogates range, (U+D800
to U+DBFF), the second from
the low-surrogates range (U+DC00 to U+DFFF). For characters in the
range U+0000 to U+FFFF,
the values of code points and UTF-16 code units are the same."
Since Java represents supplementary characters as pairs of 16-byte code
units instead of as single '\Uxxxxxxxx' string sequences, I don't know
how to parse these latter values in my ANTLR grammar. Does ANTLR's
Java-based lexer handle Unicode supplementary characters passed to it in
the '\Uxxxxxxxx' format? How should I handle something like this?
2. The ECMA 334 specifications defines identifiers in terms of Unicode
character categories as follows:
letter-character::
A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
A unicode-escape-sequence representing a character of classes Lu,
Ll, Lt, Lm, Lo, or Nl
combining-character::
A Unicode character of classes Mn or Mc
A unicode-escape-sequence representing a character of classes Mn or Mc
decimal-digit-character::
A Unicode character of the class Nd
A unicode-escape-sequence representing a character of the class Nd
connecting-character::
A Unicode character of the class Pc
A unicode-escape-sequence representing a character of the class Pc
formatting-character::
A Unicode character of the class Cf
A unicode-escape-sequence representing a character of the class Cf
I have read various posts in the ANTLR newsgroups on how to recognize
these categories and see that a number of strategies have been discussed:
- Define the categories in terms of Unicode escape sequences, eg.
"WHITESPACE: ('\u0020' | ('\u2000'.'\u0200A') | ...etc...);".
- Define the categories in terms of semantic predicates, eg.
"WHITESPACE: { IsUnicodeCategoryZs (LA (1)) }? ;".
The difficulty with the first approach in regard to the C# specification
is that some Unicode character categories (ex: Nd) contain
supplementary characters. This means that in order to represent them in
a Java-friendly manner, I need to convert them into pairs of Unicode
characters. Ugghh! I don't mind the second option, although it sounds
as though it will run slower.
Since the newer regular expression engines used by Perl, Java, C#, etc.
have been built to match on Unicode, could ANTLR's EBNF be extended to
match on the following:
- Unicode Character Properties (i.e.: \p{Lu}, \p{Mn}, etc...)
- Unicode Scripts (i.e. \p{Common}, \p{Arabic}, etc...)
- Unicode Blocks (i.e. \p{Currency_Symbol}, \p{InBasic_Latin}, ...)
If this were done, we could use them in the ANTLR parser as so:
letter-character:
: \p{Lu}
| \p{Ll}
| \p{Lt}
| \p{Lm}
| \p{Lo}
| \p{Nl}
etc...
We could also use them to do the following:
- Define characters in terms of negations (ex: "TOKEN: \p{^Lu} |
\p{^Letter};)
- Match letters including diacritics (ex: LETTER_AND_DIACRITIC: \p{L}
\p{M}*)
That's it for me. Thanks for reading this far.
Darryl
More information about the antlr-interest
mailing list