[antlr-interest] Recognizing 5-th hex digit

David-Sarah Hopwood david-sarah at jacaranda.org
Wed Aug 26 17:16:41 PDT 2009


Kieran Beltran wrote:
> I am working on an ANTLR grammar to support the ISO Standard Z notation
> (specification language). The Z character set includes many non-ASCII
> characters, so the lexer must recognize unicode character sequences, which,
> for lexer token definitions comprising 4-hex escaped unicode (\uxxxx), I
> believe ANTRL works fine.
> 
> I have encountered a problem when attempting to recognize two required
> Standard Z symbols which are "above" the four-hex set recognized by my
> generated lexer. The two symbols are \u1D538 and \u1D53D.

> A review of the UCS documentation
> http://unicode.org/Public/UNIDATA/UnicodeData.txt indicates that indeed
> there is a 5-th hex digit that is used "publically", albeit infrequently -
> primarily for mathematics, musical symbols and other areas.

Strictly speaking the code unit range goes up to U+10FFFF (not all of which
are valid characters). The \u notation isn't typically used for characters
above U+FFFF, because it would be ambiguous with a four-digit escape
followed by an unescaped hex digit.

> Not sure many
> folks are writing grammars requiring recognition of such character sets.
> Interestingly, the 5-th hex digit only needs to reach E as the highest UCS
> symbol that might be used publically appears to currently be \uE01EF. Above
> F0000 appears to be for private use only.
> 
> Looking at the ANTLRv3.g grammar within the ESC fragment definition, I
> believe that the four-hex unicode definition is defined:
> see line 495        'u' XDIGIT XDIGIT XDIGIT XDIGIT

You can match such characters without changes to ANTLR by converting the
input to UTF-16 (for example by using java.io.InputStreamReader and
ANTLRReaderStream, if the target language is Java), and matching their
UTF-16 encodings. In this case that would be

  U+1D538  '\uD835\uDD38'
  U+1D53D  '\uD835\uDD3D'

Note that this relies on the fact that ANTLRStringStream and its subclasses
do not convert from UTF-16 to code points, even though ANTLR uses integer
code point streams internally. It is possible that this may cause you to
have to change your grammar if and when ANTLR supports supplementary
characters fully, although I don't see any easy way around that.

See <http://people.w3.org/rishida/scripts/uniview/conversion> for an on-line
converter between various escape formats. Here you want the output labelled
as "JavaScript escapes".

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com



More information about the antlr-interest mailing list