[antlr-interest] Unicode XID_Start/XID_Continue? (And, other Unicode properties)

Sun Jul 6 07:49:47 PDT 2008

Johannes Luber wrote:
> Joe schrieb:
>> Gavin Lambert wrote:
>>> At 14:41 6/07/2008, Joe wrote:
>>> >But what about characters outside the BMP? For example how
>>> >would I match the CJK UNIFIED IDEOGRAPH range (U+20000..U+2A6D6)?
>>> >Indivdually splitting them into two 16-bit characters is not
>>> >a viable solution.
>>>
>>> That one, I'm less sure about, and I think the answer depends on 
>>> your target language.
>>>
>>> For example, I think Java uses UTF-16, which means that you do 
>>> indeed have to split them into two 16-bit characters (because that's 
>>> how it encodes them).
>>>
>>> Whereas the C target uses UTF-32, so I think you wouldn't need to do 
>>> that.  I'm not sure how to express the character in the grammar, 
>>> though -- I've never needed to do that.
>>>
>>> (And you should try to keep responses on the list -- that way other 
>>> people can chime in if they have a better answer.  Use your Reply 
>>> All button.)
>>>
>> Yes, UTF-16 characters may be 2x16 bit, but the decoding should not 
>> have to be in the grammar (which would be extremely impractical). It 
>> should be taken care of internally. As far as I can tell, this would 
>> require changing the LA method (which already returns an int) and 
>> adding another escape sequence for characters in the range from 
>> U+100000 to U+10FFFF (I may be wrong though).
>>
>
> I have created a special rule to parse UTF-16 surrogate codepoints for 
> my own C# lexer:
>
> // This rule is supposed to catch all characters which may be used as 
> a part of an identifier.
> // Note that this rule is a superset which may not only include 
> positional invalid characters,
> // but always invalid characters.
> // Sort the bad identifiers out in the parser where the symbol tables 
> are build. Digits aren't
> // included because allowing them at first place causes confusion with 
> INTEGER_LITERAL
> // and REAL_LITERAL. The way it is structured is to get the characters 
> in the UTF-16 encoding.
> //
> // This is a hack to workaround the ANTLR 3 limitation that one can't 
> choose unicode character
> // classes directly. Also known characters required by other rules are 
> excluded.
> fragment ANY_UNUSED_CHARACTER
>     :    'A'..'Z'    // Use only alphabet characters below U+0080
>     |    'a'..'z'
>     |    '\u0080'..'\u009F'    // NO NO_BREAK SPACE
>     |    '\u00A1'..'\u167F'    // NO OGHAM SPACE MARK
>     |    '\u1681'..'\u180D'    // NO MONGOLIAN VOWEL SEPARATOR
>     |    '\u180F'..'\u1FFF'    // NO EN QUAD, EM QUAD, EN SPACE, 
> THREE_PER_EM SPACE, FOUR_PER_EM SPACE, SIX_PER_EM SPACE
>     |    '\u2007'        // NO PUNCTUATION SPACE, THIN SPACE, HAIR SPACE
>     |    '\u200B'..'\u202E'    // NO NARROW NO_BREAK SPACE
>     |    '\u2030'..'\u205E'    // NO MEDIUM MATHEMATICAL SPACE
>     |    '\u2060'..'\u2FFF'    // NO IDEOGRAPHIC SPACE
>     |    '\u3001'..'\uD7FF'
>     |    '\uE000'..'\uFFFE'
>     |    '\uD800'..'\uDBFF' '\uDC00'..'\uDFFF' // Surrogate code points
>     ;
>
> I've attached the whole file in case you want to look at the other rules.
>
> Johannes
That solves the problem of recognizing them as pairs that belong 
together. Too bad it can't really replace the uncicode categories. Do 
you think it would be possible to integrate a handwritten Lexer using 
ICU with ANTLR generated parser and tree parser? I couldn't find much 
info on the C interface (wish there was a working C++ interface), so I'm 
not sure if that is feasible.