[antlr-interest] Unicode XID_Start/XID_Continue? (And, other Unicode properties)

Sat Jul 5 20:21:07 PDT 2008

At 14:41 6/07/2008, Joe wrote:
 >But what about characters outside the BMP? For example how
 >would I match the CJK UNIFIED IDEOGRAPH range 
(U+20000..U+2A6D6)?
 >Indivdually splitting them into two 16-bit characters is not
 >a viable solution.

That one, I'm less sure about, and I think the answer depends on 
your target language.

For example, I think Java uses UTF-16, which means that you do 
indeed have to split them into two 16-bit characters (because 
that's how it encodes them).

Whereas the C target uses UTF-32, so I think you wouldn't need to 
do that.  I'm not sure how to express the character in the 
grammar, though -- I've never needed to do that.

(And you should try to keep responses on the list -- that way 
other people can chime in if they have a better answer.  Use your 
Reply All button.)