[antlr-interest] Re: Unicode support
Mark Lentczner
markl at glyphic.com
Fri May 21 08:19:32 PDT 2004
On May 21, 2004, at 6:44 AM, meilland78 wrote:
> Yes my only requirement is to parse "<asian characters here>".
> The language I parse has strings which can contain unicode
> characters. But the language itself doesnt need to be in unicode.
Okay, easy as pie!
Step one: Make sure your input is fed to the lexer as UTF-8 encoded
bytes, not Unicode characters. This shouldn't be hard in either Java
or C++.
Step two: Add this to your lexer options:
options {
charVocabulary = '\u0000'..'\u00ff';
}
Step three: Add this to your lexer rules:
STRING: '"'! ( options{greedy=false;}: UTF8_CHAR )* '"'! ;
protected UTF8_CHAR:
'\u0000'..'\u007F'
| '\u00C2'..'\u00DF' UTF8_EXT_80_BF
| '\u00E0' UTF8_EXT_A0_BF UTF8_EXT_80_BF
| '\u00E1'..'\u00EF' UTF8_EXT_80_BF UTF8_EXT_80_BF
| '\u00F0' UTF8_EXT_90_BF UTF8_EXT_80_BF
UTF8_EXT_80_BF
| '\u00F1'..'\u00F3' UTF8_EXT_80_BF UTF8_EXT_80_BF
UTF8_EXT_80_BF
| '\u00F4' UTF8_EXT_80_8F UTF8_EXT_80_BF
UTF8_EXT_80_BF
;
protected UTF8_EXT_80_BF: '\u0080'..'\u00BF' ;
protected UTF8_EXT_80_8F: '\u0080'..'\u008F' ;
protected UTF8_EXT_90_BF: '\u0090'..'\u00BF' ;
protected UTF8_EXT_A0_BF: '\u00A0'..'\u00BF' ;
This will accept any Unicode character, legally encoded in UTF-8. Note
that the '\uXXXX' notation is being used here to specify only 8-bit
byte values, not actual Unicode characters.
Remember that the text of the STRING tokens will be UTF-8 encoded. You
could decode this back into Unicode strings either in the STRING rule
itself, or later in your parser or tree walker(s) as needed.
- Mark
Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list