[antlr-interest] Re: Unicode support

Mark Lentczner markl at glyphic.com
Fri May 21 08:19:32 PDT 2004


On May 21, 2004, at 6:44 AM, meilland78 wrote:
> Yes my only requirement is to parse "<asian characters here>".
> The language I parse has strings which can contain unicode
> characters. But the language itself doesnt need to be in unicode.

Okay, easy as pie!

Step one: Make sure your input is fed to the lexer as UTF-8 encoded 
bytes, not Unicode characters.  This shouldn't be hard in either Java 
or C++.

Step two: Add this to your lexer options:

     options {
         charVocabulary = '\u0000'..'\u00ff';
     }

Step three: Add this to your lexer rules:

     STRING: '"'! ( options{greedy=false;}: UTF8_CHAR )* '"'! ;

     protected UTF8_CHAR:
           '\u0000'..'\u007F'
         | '\u00C2'..'\u00DF' UTF8_EXT_80_BF
         | '\u00E0'           UTF8_EXT_A0_BF UTF8_EXT_80_BF
         | '\u00E1'..'\u00EF' UTF8_EXT_80_BF UTF8_EXT_80_BF
         | '\u00F0'           UTF8_EXT_90_BF UTF8_EXT_80_BF 
UTF8_EXT_80_BF
         | '\u00F1'..'\u00F3' UTF8_EXT_80_BF UTF8_EXT_80_BF 
UTF8_EXT_80_BF
         | '\u00F4'           UTF8_EXT_80_8F UTF8_EXT_80_BF 
UTF8_EXT_80_BF
         ;

     protected UTF8_EXT_80_BF: '\u0080'..'\u00BF' ;
     protected UTF8_EXT_80_8F: '\u0080'..'\u008F' ;
     protected UTF8_EXT_90_BF: '\u0090'..'\u00BF' ;
     protected UTF8_EXT_A0_BF: '\u00A0'..'\u00BF' ;

This will accept any Unicode character, legally encoded in UTF-8.  Note 
that the '\uXXXX' notation is being used here to specify only 8-bit 
byte values, not actual Unicode characters.

Remember that the text of the STRING tokens will be UTF-8 encoded.  You 
could decode this back into Unicode strings either in the STRING rule 
itself, or later in your parser or tree walker(s) as needed.

	- Mark

Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list