[antlr-interest] Recognizing 5-th hex digit

Wed Aug 26 16:29:36 PDT 2009

Gavin Lambert wrote:
> At 07:35 27/08/2009, Kieran Beltran wrote:
>> I have encountered a problem when attempting to recognize two 
>> required Standard Z symbols which are "above" the four-hex set 
>> recognized by my generated lexer. The two symbols are \u1D538 and 
>> \u1D53D.
> [...]
>> Is the solution to include a fifth digit to be recognized 
>> optionally? Could I simply replace line 495 (as below) and add a 
>> new fragment
>>
>> 'u' ZDIGIT? XDIGIT XDIGIT XDIGIT XDIGIT
> 
> No.  It also depends on the stream encoding.  IIRC the Java target 
> at least reads in files as UTF-16.  So there's no "room" in a 
> single character to store that single digit.
> 
> Instead, you need to encode it as a surrogate pair. \u1D538, for 
> example, would be encoded as \uD835\uDD38.

I believe this is correct - Java's support beyond the BMP is confusing 
and somewhat patchy. Sometimes 'character' means a code point (a full 
UCS character, needing 4 bytes to fully specify) and sometimes, as in 
the char datatype, it means a 'code unit', a piece of UTF-16. Certainly 
anything that is ever going to be used to check the value of a char must 
be a code unit, not a code point, hence using surrogate pairs. This is 
awkward, but there's no sane way to get around it.

> I'm not entirely sure how it works in the C target, which uses 
> UTF-32 encoding by default; I've never really needed to use 
> characters that high up.

There may be a problem in that case in the java code used to generate 
the C, but I'm not sure. I can see how there could be. If, however, 
you're transcoding input from whatever it is (UTF-8, UTF-16, something 
from ISO-2022, whatever) to UTF-32, surrogate pairs are likely to *not* 
work, as they aren't present in the resulting byte stream.

Well, that was a random outpouring...

-- 
Sam Barnett-Cormack