[antlr-interest] Recognizing 5-th hex digit

Gavin Lambert antlr at mirality.co.nz
Wed Aug 26 14:16:10 PDT 2009


At 07:35 27/08/2009, Kieran Beltran wrote:
>I have encountered a problem when attempting to recognize two 
>required Standard Z symbols which are "above" the four-hex set 
>recognized by my generated lexer. The two symbols are \u1D538 and 
>\u1D53D.
[...]
>Is the solution to include a fifth digit to be recognized 
>optionally? Could I simply replace line 495 (as below) and add a 
>new fragment
>
>'u' ZDIGIT? XDIGIT XDIGIT XDIGIT XDIGIT

No.  It also depends on the stream encoding.  IIRC the Java target 
at least reads in files as UTF-16.  So there's no "room" in a 
single character to store that single digit.

Instead, you need to encode it as a surrogate pair. \u1D538, for 
example, would be encoded as \uD835\uDD38.


I'm not entirely sure how it works in the C target, which uses 
UTF-32 encoding by default; I've never really needed to use 
characters that high up.



More information about the antlr-interest mailing list