[antlr-interest] Unicode Category Question

Mon Mar 24 12:27:57 PDT 2008

Darryl A. J. Staflund schrieb:
> Hi everyone,
> 
> I am a new ANTLR user and have started to write an EMCA-compliant lexer 
> for C# using ANTLR 3.0.1.  I know that other C# lexers exist on the site 
> but I want to try writing one for myself to get a feel for ANTLR and to 
> learn how to use it to deal with pre-processing directives, etc.  Since 
> I hope to compile the lexer using C# instead of Java, I have set my 
> target language as 'CSharp' in the options of my grammar file.  I then 
> generate the source code using the Java-based 'org.antlr.Tool' class.  I 
> am using the Sun Java SDK 1.6.0_05.
> 
> I have run into two difficulties with this current approach:
> 
> 1.   The ECMA 334 specification defines a C# 2.x Unicode escape sequence 
> as follows:
> 
>        unicode-escape-sequence::
>            \u hex-digit hex-digit hex-digit hex-digit
>            \U hex-digit hex-digit hex-digit hex-digit hex-digit 
> hex-digit hex-digit hex-digit
> 
> Although the current (?) Java specification handles the first option 
> just fine, it handles the second option a bit differently as stated in 
> http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1:
> 
>    "The Unicode standard was originally designed as a fixed-width 16-bit 
> character encoding. It
>    has since been changed to allow for characters whose representation 
> requires more than 16 bits.
>    The range of legal code points is now U+0000 to U+10FFFF, using the 
> hexadecimal U+n notation.
>    Characters whose code points are greater than U+FFFF are called 
> supplementary characters. To
>    represent the complete range of characters using only 16-bit units, 
> the Unicode standard defines an
>    encoding called UTF-16. In this encoding, supplementary characters 
> are represented as pairs of
>    16-bit code units, the first from the high-surrogates range, (U+D800 
> to U+DBFF), the second from
>    the low-surrogates range (U+DC00 to U+DFFF). For characters in the 
> range U+0000 to U+FFFF,
>    the values of code points and UTF-16 code units are the same."
> 
> Since Java represents supplementary characters as pairs of 16-byte code 
> units instead of as single '\Uxxxxxxxx' string sequences, I don't know 
> how to parse these latter values in my ANTLR grammar.  Does ANTLR's 
> Java-based lexer handle Unicode supplementary characters passed to it in 
> the '\Uxxxxxxxx' format?  How should I handle something like this?

I refer for these questions to my dissertation, which I will will send 
to you (and anyone else) off-list.

> 2.  The ECMA 334 specifications defines identifiers in terms of Unicode 
> character categories as follows:
> 
>    letter-character::
>    A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
>    A unicode-escape-sequence representing a character of classes Lu, Ll, 
> Lt, Lm, Lo, or Nl
> 
>    combining-character::
>    A Unicode character of classes Mn or Mc
>    A unicode-escape-sequence representing a character of classes Mn or Mc
> 
>    decimal-digit-character::
>    A Unicode character of the class Nd
>    A unicode-escape-sequence representing a character of the class Nd
> 
>    connecting-character::
>    A Unicode character of the class Pc
>    A unicode-escape-sequence representing a character of the class Pc
> 
>    formatting-character::
>    A Unicode character of the class Cf
>    A unicode-escape-sequence representing a character of the class Cf
> 
> I have read various posts in the ANTLR newsgroups on how to recognize 
> these categories and see that a number of strategies have been discussed:
> 
> - Define the categories in terms of Unicode escape sequences, eg.  
> "WHITESPACE:  ('\u0020' | ('\u2000'.'\u0200A') | ...etc...);".
> - Define the categories in terms of semantic predicates, eg.  
> "WHITESPACE:  { IsUnicodeCategoryZs (LA (1)) }? ;".
> 
> The difficulty with the first approach in regard to the C# specification 
> is that some Unicode character categories (ex:  Nd) contain 
> supplementary characters.  This means that in order to represent them in 
> a Java-friendly manner, I need to convert them into pairs of Unicode 
> characters.  Ugghh!  I don't mind the second option, although it sounds 
> as though it will run slower.
> 
> Since the newer regular expression engines used by Perl, Java, C#, etc. 
> have been built to match on Unicode, could ANTLR's EBNF be extended to 
> match on the following:
> 
> - Unicode Character Properties (i.e.:  \p{Lu}, \p{Mn}, etc...)
> - Unicode Scripts (i.e.  \p{Common}, \p{Arabic}, etc...)
> - Unicode Blocks (i.e. \p{Currency_Symbol}, \p{InBasic_Latin}, ...)
> 
> 
> If this were done, we could use them in the ANTLR parser as so:
> 
>    letter-character:
>        :    \p{Lu}
>        |    \p{Ll}
>        |    \p{Lt}
>        |    \p{Lm}
>        |    \p{Lo}
>        |    \p{Nl}
>      
> etc...
> 
> We could also use them to do the following:
> 
> - Define characters in terms of negations (ex:  "TOKEN:  \p{^Lu} | 
> \p{^Letter};)
> - Match letters including diacritics (ex:  LETTER_AND_DIACRITIC:  \p{L} 
> \p{M}*)

An interesting enhancement, but unnecessary for the problem, as 
described in my dissertation.

Johannes