[antlr-interest] Unicode XID_Start/XID_Continue? (And, other Unicode properties)

Sat Jul 5 20:35:10 PDT 2008

Gavin Lambert wrote:
> At 14:41 6/07/2008, Joe wrote:
> >But what about characters outside the BMP? For example how
> >would I match the CJK UNIFIED IDEOGRAPH range (U+20000..U+2A6D6)?
> >Indivdually splitting them into two 16-bit characters is not
> >a viable solution.
>
> That one, I'm less sure about, and I think the answer depends on your 
> target language.
>
> For example, I think Java uses UTF-16, which means that you do indeed 
> have to split them into two 16-bit characters (because that's how it 
> encodes them).
>
> Whereas the C target uses UTF-32, so I think you wouldn't need to do 
> that.  I'm not sure how to express the character in the grammar, 
> though -- I've never needed to do that.
>
> (And you should try to keep responses on the list -- that way other 
> people can chime in if they have a better answer.  Use your Reply All 
> button.)
>
Yes, UTF-16 characters may be 2x16 bit, but the decoding should not have 
to be in the grammar (which would be extremely impractical). It should 
be taken care of internally. As far as I can tell, this would require 
changing the LA method (which already returns an int) and adding another 
escape sequence for characters in the range from U+100000 to U+10FFFF (I 
may be wrong though).

-- 
Generally speaking, things have gone about as far
as they can possibly go, when things have gotten
about as bad as they can reasonably get.