[antlr-interest] Unicode XID_Start/XID_Continue? (And, other Unicode properties)

Johannes Luber jaluber at gmx.de
Sun Jul 6 06:00:03 PDT 2008


Joe schrieb:
> Gavin Lambert wrote:
>> At 14:41 6/07/2008, Joe wrote:
>> >But what about characters outside the BMP? For example how
>> >would I match the CJK UNIFIED IDEOGRAPH range (U+20000..U+2A6D6)?
>> >Indivdually splitting them into two 16-bit characters is not
>> >a viable solution.
>>
>> That one, I'm less sure about, and I think the answer depends on your 
>> target language.
>>
>> For example, I think Java uses UTF-16, which means that you do indeed 
>> have to split them into two 16-bit characters (because that's how it 
>> encodes them).
>>
>> Whereas the C target uses UTF-32, so I think you wouldn't need to do 
>> that.  I'm not sure how to express the character in the grammar, 
>> though -- I've never needed to do that.
>>
>> (And you should try to keep responses on the list -- that way other 
>> people can chime in if they have a better answer.  Use your Reply All 
>> button.)
>>
> Yes, UTF-16 characters may be 2x16 bit, but the decoding should not have 
> to be in the grammar (which would be extremely impractical). It should 
> be taken care of internally. As far as I can tell, this would require 
> changing the LA method (which already returns an int) and adding another 
> escape sequence for characters in the range from U+100000 to U+10FFFF (I 
> may be wrong though).
> 

I have created a special rule to parse UTF-16 surrogate codepoints for 
my own C# lexer:

// This rule is supposed to catch all characters which may be used as a 
part of an identifier.
// Note that this rule is a superset which may not only include 
positional invalid characters,
// but always invalid characters.
// Sort the bad identifiers out in the parser where the symbol tables 
are build. Digits aren't
// included because allowing them at first place causes confusion with 
INTEGER_LITERAL
// and REAL_LITERAL. The way it is structured is to get the characters 
in the UTF-16 encoding.
//
// This is a hack to workaround the ANTLR 3 limitation that one can't 
choose unicode character
// classes directly. Also known characters required by other rules are 
excluded.
fragment ANY_UNUSED_CHARACTER
	:	'A'..'Z'	// Use only alphabet characters below U+0080
	|	'a'..'z'
	|	'\u0080'..'\u009F'	// NO NO_BREAK SPACE
	|	'\u00A1'..'\u167F'	// NO OGHAM SPACE MARK
	|	'\u1681'..'\u180D'	// NO MONGOLIAN VOWEL SEPARATOR
	|	'\u180F'..'\u1FFF'	// NO EN QUAD, EM QUAD, EN SPACE, THREE_PER_EM 
SPACE, FOUR_PER_EM SPACE, SIX_PER_EM SPACE
	|	'\u2007'		// NO PUNCTUATION SPACE, THIN SPACE, HAIR SPACE
	|	'\u200B'..'\u202E'	// NO NARROW NO_BREAK SPACE
	|	'\u2030'..'\u205E'	// NO MEDIUM MATHEMATICAL SPACE
	|	'\u2060'..'\u2FFF'	// NO IDEOGRAPHIC SPACE
	|	'\u3001'..'\uD7FF'
	|	'\uE000'..'\uFFFE'
	|	'\uD800'..'\uDBFF' '\uDC00'..'\uDFFF' // Surrogate code points
	;

I've attached the whole file in case you want to look at the other rules.

Johannes
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: CSharp3Lexer.g
Url: http://www.antlr.org/pipermail/antlr-interest/attachments/20080706/e450fd31/attachment.pl 


More information about the antlr-interest mailing list