[antlr-interest] Unicode XID_Start/XID_Continue? (And, other Unicode properties)
jaluber at gmx.de
Sun Jul 6 06:00:03 PDT 2008
> Gavin Lambert wrote:
>> At 14:41 6/07/2008, Joe wrote:
>> >But what about characters outside the BMP? For example how
>> >would I match the CJK UNIFIED IDEOGRAPH range (U+20000..U+2A6D6)?
>> >Indivdually splitting them into two 16-bit characters is not
>> >a viable solution.
>> That one, I'm less sure about, and I think the answer depends on your
>> target language.
>> For example, I think Java uses UTF-16, which means that you do indeed
>> have to split them into two 16-bit characters (because that's how it
>> encodes them).
>> Whereas the C target uses UTF-32, so I think you wouldn't need to do
>> that. I'm not sure how to express the character in the grammar,
>> though -- I've never needed to do that.
>> (And you should try to keep responses on the list -- that way other
>> people can chime in if they have a better answer. Use your Reply All
> Yes, UTF-16 characters may be 2x16 bit, but the decoding should not have
> to be in the grammar (which would be extremely impractical). It should
> be taken care of internally. As far as I can tell, this would require
> changing the LA method (which already returns an int) and adding another
> escape sequence for characters in the range from U+100000 to U+10FFFF (I
> may be wrong though).
I have created a special rule to parse UTF-16 surrogate codepoints for
my own C# lexer:
// This rule is supposed to catch all characters which may be used as a
part of an identifier.
// Note that this rule is a superset which may not only include
positional invalid characters,
// but always invalid characters.
// Sort the bad identifiers out in the parser where the symbol tables
are build. Digits aren't
// included because allowing them at first place causes confusion with
// and REAL_LITERAL. The way it is structured is to get the characters
in the UTF-16 encoding.
// This is a hack to workaround the ANTLR 3 limitation that one can't
choose unicode character
// classes directly. Also known characters required by other rules are
: 'A'..'Z' // Use only alphabet characters below U+0080
| '\u0080'..'\u009F' // NO NO_BREAK SPACE
| '\u00A1'..'\u167F' // NO OGHAM SPACE MARK
| '\u1681'..'\u180D' // NO MONGOLIAN VOWEL SEPARATOR
| '\u180F'..'\u1FFF' // NO EN QUAD, EM QUAD, EN SPACE, THREE_PER_EM
SPACE, FOUR_PER_EM SPACE, SIX_PER_EM SPACE
| '\u2007' // NO PUNCTUATION SPACE, THIN SPACE, HAIR SPACE
| '\u200B'..'\u202E' // NO NARROW NO_BREAK SPACE
| '\u2030'..'\u205E' // NO MEDIUM MATHEMATICAL SPACE
| '\u2060'..'\u2FFF' // NO IDEOGRAPHIC SPACE
| '\uD800'..'\uDBFF' '\uDC00'..'\uDFFF' // Surrogate code points
I've attached the whole file in case you want to look at the other rules.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
More information about the antlr-interest