[antlr-interest] Unicode support

Mark Lentczner markl at glyphic.com
Thu May 13 08:13:58 PDT 2004


Ric already gave you the outline of where Unicode support sits.  Here 
is my take:

It depends on what you need to do.  If all you need is that things like 
strings or data need to accept the full range of Unicode, you can do 
this fairly easily.  If you also identifiers (variable and function 
names or the like) to accept full Unicode, this is also do able, though 
a little harder.  If literals and other symbols are from the full 
Unicode set, this is still harder.

If you are working in Java, then Antlr works with Unicode in so far as 
Java does: It assumes a 16-bit char, and and only understands the 
Unicode BMP (characters <= U+FFFF).  Characters above U+FFFF will 
appear as surrogate paris, but Java and Antlr will treat them as two 
characters.  This is generally not a problem unless you are trying to 
match those characters explicitly (in literals or as lexer rules), in 
which case you need to hand code them as the pairs - and you may need 
to increase your k value.

The only other down side of this is that if you set your charVocabulary 
to '\u0001'..'\uFFFE' then constructs like ~(' '|'\t') will generate 
large bitset objects.  But then, that's what computers are for...

In C++ the above approach may work, but it is not supported.

A different approach for C++ is to treat your input as UTF-8 encoded 
Unicode and lex over that.  This entails changing your lexer rules 
where they match Unicode characters to instead match the UTF-8 
sequence:

lexing Unicode:
	A_WITH_GRAVE: '\u00C0' | '\u00E0' ; 	// upper and lower case
lexing UTF-8 encoded Unicode:
	A_WITH_GRAVE: '\u00C3' '\u0080' | '\u00C3' '\u00A0' ;
or
	A_WITH_GRAVE: '\u00C3' ( '\u0080' | '\u00A0' ) ;

This works, but can get very complex to generate by hand, especially 
when ranges are involved.  Notice that even though I used '\u' as the 
escape sequence, in the UTF8 code, these are really 8-bit bytes.  The 
character vocabulary for a UTF-8 encoded Unicode lexer is just 
'\u0000'..'\u00FF'.

If you need to go down the UTF-8 path, I have some perl scripts that 
could help, let me know.

	- Mark

Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list