[antlr-interest] Recognizing 5-th hex digit

Kieran Beltran kieran.beltran at gmail.com
Wed Aug 26 17:01:30 PDT 2009


Sam / Gavin thank-you.

So, in the case I am receiving UTF-32 input, I would need to preprocess
(using UTF-32-->UTF-16 algorithim) for characters in the 10000 to 10FFFF
ranges and convert them into surrogate pairs, passing that input to
ANTLRInputStream.

In my lexer definition, where appropriate, I would define the tokens to
recognize the surrogate pairs for example:

fragment ARITHMOS: '\uD835\uDD38'; // recognize UTF-32 (0001 D538) arithmos
fragment FINSET: '\uD835\uDD3D';      // recognize UTF-32 (0001 D53D) finite
set

As indicated this is only for a Java targets.

Have I got it right?

-Kieran




On Wed, Aug 26, 2009 at 7:29 PM, Sam Barnett-Cormack <
s.barnett-cormack at lancaster.ac.uk> wrote:

> Gavin Lambert wrote:
>
>> At 07:35 27/08/2009, Kieran Beltran wrote:
>>
>>> I have encountered a problem when attempting to recognize two required
>>> Standard Z symbols which are "above" the four-hex set recognized by my
>>> generated lexer. The two symbols are \u1D538 and \u1D53D.
>>>
>> [...]
>>
>>> Is the solution to include a fifth digit to be recognized optionally?
>>> Could I simply replace line 495 (as below) and add a new fragment
>>>
>>> 'u' ZDIGIT? XDIGIT XDIGIT XDIGIT XDIGIT
>>>
>>
>> No.  It also depends on the stream encoding.  IIRC the Java target at
>> least reads in files as UTF-16.  So there's no "room" in a single character
>> to store that single digit.
>>
>> Instead, you need to encode it as a surrogate pair. \u1D538, for example,
>> would be encoded as \uD835\uDD38.
>>
>
> I believe this is correct - Java's support beyond the BMP is confusing and
> somewhat patchy. Sometimes 'character' means a code point (a full UCS
> character, needing 4 bytes to fully specify) and sometimes, as in the char
> datatype, it means a 'code unit', a piece of UTF-16. Certainly anything that
> is ever going to be used to check the value of a char must be a code unit,
> not a code point, hence using surrogate pairs. This is awkward, but there's
> no sane way to get around it.
>
> I'm not entirely sure how it works in the C target, which uses UTF-32
>> encoding by default; I've never really needed to use characters that high
>> up.
>>
>
> There may be a problem in that case in the java code used to generate the
> C, but I'm not sure. I can see how there could be. If, however, you're
> transcoding input from whatever it is (UTF-8, UTF-16, something from
> ISO-2022, whatever) to UTF-32, surrogate pairs are likely to *not* work, as
> they aren't present in the resulting byte stream.
>
> Well, that was a random outpouring...
>
> --
> Sam Barnett-Cormack
>



-- 
Respectfully,
Kieran J. Beltran

Phone: (646) 294-7102
Web Page: www.lostrivercreations.com/kieran
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090826/90c1b8d1/attachment.html 


More information about the antlr-interest mailing list