[antlr-interest] Unicode Category Question

Sat Mar 22 13:52:39 PDT 2008

Hi everyone,

I am a new ANTLR user and have started to write an EMCA-compliant lexer 
for C# using ANTLR 3.0.1.  I know that other C# lexers exist on the site 
but I want to try writing one for myself to get a feel for ANTLR and to 
learn how to use it to deal with pre-processing directives, etc.  Since 
I hope to compile the lexer using C# instead of Java, I have set my 
target language as 'CSharp' in the options of my grammar file.  I then 
generate the source code using the Java-based 'org.antlr.Tool' class.  I 
am using the Sun Java SDK 1.6.0_05.

I have run into two difficulties with this current approach:

1.   The ECMA 334 specification defines a C# 2.x Unicode escape sequence 
as follows:

        unicode-escape-sequence::
            \u hex-digit hex-digit hex-digit hex-digit
            \U hex-digit hex-digit hex-digit hex-digit hex-digit 
hex-digit hex-digit hex-digit

Although the current (?) Java specification handles the first option 
just fine, it handles the second option a bit differently as stated in 
http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1:

    "The Unicode standard was originally designed as a fixed-width 
16-bit character encoding. It
    has since been changed to allow for characters whose representation 
requires more than 16 bits.
    The range of legal code points is now U+0000 to U+10FFFF, using the 
hexadecimal U+n notation.
    Characters whose code points are greater than U+FFFF are called 
supplementary characters. To
    represent the complete range of characters using only 16-bit units, 
the Unicode standard defines an
    encoding called UTF-16. In this encoding, supplementary characters 
are represented as pairs of
    16-bit code units, the first from the high-surrogates range, (U+D800 
to U+DBFF), the second from
    the low-surrogates range (U+DC00 to U+DFFF). For characters in the 
range U+0000 to U+FFFF,
    the values of code points and UTF-16 code units are the same."

Since Java represents supplementary characters as pairs of 16-byte code 
units instead of as single '\Uxxxxxxxx' string sequences, I don't know 
how to parse these latter values in my ANTLR grammar.  Does ANTLR's 
Java-based lexer handle Unicode supplementary characters passed to it in 
the '\Uxxxxxxxx' format?  How should I handle something like this?

2.  The ECMA 334 specifications defines identifiers in terms of Unicode 
character categories as follows:

    letter-character::
    A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
    A unicode-escape-sequence representing a character of classes Lu, 
Ll, Lt, Lm, Lo, or Nl

    combining-character::
    A Unicode character of classes Mn or Mc
    A unicode-escape-sequence representing a character of classes Mn or Mc

    decimal-digit-character::
    A Unicode character of the class Nd
    A unicode-escape-sequence representing a character of the class Nd

    connecting-character::
    A Unicode character of the class Pc
    A unicode-escape-sequence representing a character of the class Pc

    formatting-character::
    A Unicode character of the class Cf
    A unicode-escape-sequence representing a character of the class Cf

I have read various posts in the ANTLR newsgroups on how to recognize 
these categories and see that a number of strategies have been discussed:

- Define the categories in terms of Unicode escape sequences, eg.  
"WHITESPACE:  ('\u0020' | ('\u2000'.'\u0200A') | ...etc...);".
- Define the categories in terms of semantic predicates, eg.  
"WHITESPACE:  { IsUnicodeCategoryZs (LA (1)) }? ;".

The difficulty with the first approach in regard to the C# specification 
is that some Unicode character categories (ex:  Nd) contain 
supplementary characters.  This means that in order to represent them in 
a Java-friendly manner, I need to convert them into pairs of Unicode 
characters.  Ugghh!  I don't mind the second option, although it sounds 
as though it will run slower.

Since the newer regular expression engines used by Perl, Java, C#, etc. 
have been built to match on Unicode, could ANTLR's EBNF be extended to 
match on the following:

- Unicode Character Properties (i.e.:  \p{Lu}, \p{Mn}, etc...)
- Unicode Scripts (i.e.  \p{Common}, \p{Arabic}, etc...)
- Unicode Blocks (i.e. \p{Currency_Symbol}, \p{InBasic_Latin}, ...)

If this were done, we could use them in the ANTLR parser as so:

    letter-character:
        :    \p{Lu}
        |    \p{Ll}
        |    \p{Lt}
        |    \p{Lm}
        |    \p{Lo}
        |    \p{Nl}

etc...

We could also use them to do the following:

- Define characters in terms of negations (ex:  "TOKEN:  \p{^Lu} | 
\p{^Letter};)
- Match letters including diacritics (ex:  LETTER_AND_DIACRITIC:  \p{L} 
\p{M}*)

That's it for me.  Thanks for reading this far.

Darryl