[antlr-interest] Unicode XID_Start/XID_Continue? (And, other Unicode properties)
Gavin Lambert
antlr at mirality.co.nz
Sat Jul 5 20:21:07 PDT 2008
At 14:41 6/07/2008, Joe wrote:
>But what about characters outside the BMP? For example how
>would I match the CJK UNIFIED IDEOGRAPH range
(U+20000..U+2A6D6)?
>Indivdually splitting them into two 16-bit characters is not
>a viable solution.
That one, I'm less sure about, and I think the answer depends on
your target language.
For example, I think Java uses UTF-16, which means that you do
indeed have to split them into two 16-bit characters (because
that's how it encodes them).
Whereas the C target uses UTF-32, so I think you wouldn't need to
do that. I'm not sure how to express the character in the
grammar, though -- I've never needed to do that.
(And you should try to keep responses on the list -- that way
other people can chime in if they have a better answer. Use your
Reply All button.)
More information about the antlr-interest
mailing list