[antlr-interest] Re: Unicode support

Fri May 21 08:34:21 PDT 2004

Thanks alot mark.

Easy as a pie, I'll confirm that only after I manage to get it 
working ;)

The language I parse contains strings using asian characters ("<asian 
characters here>") but will never be itself in some asian language.
But the input I provide to the lexer is completly in unicode. So I 
guess I'll have to find a way to convert it to UTF8 and then convert 
result back to unicode.

I'll keep you in touch on how things go for me.

Thanks again for your help.

Cheers,

J.Claude.

--- In antlr-interest at yahoogroups.com, Mark Lentczner <markl at g...> 
wrote:
> 
> On May 21, 2004, at 6:44 AM, meilland78 wrote:
> > Yes my only requirement is to parse "<asian characters here>".
> > The language I parse has strings which can contain unicode
> > characters. But the language itself doesnt need to be in unicode.
> 
> Okay, easy as pie!
> 
> Step one: Make sure your input is fed to the lexer as UTF-8 encoded 
> bytes, not Unicode characters.  This shouldn't be hard in either 
Java 
> or C++.
> 
> Step two: Add this to your lexer options:
> 
>      options {
>          charVocabulary = '\u0000'..'\u00ff';
>      }
> 
> Step three: Add this to your lexer rules:
> 
>      STRING: '"'! ( options{greedy=false;}: UTF8_CHAR )* '"'! ;
> 
>      protected UTF8_CHAR:
>            '\u0000'..'\u007F'
>          | '\u00C2'..'\u00DF' UTF8_EXT_80_BF
>          | '\u00E0'           UTF8_EXT_A0_BF UTF8_EXT_80_BF
>          | '\u00E1'..'\u00EF' UTF8_EXT_80_BF UTF8_EXT_80_BF
>          | '\u00F0'           UTF8_EXT_90_BF UTF8_EXT_80_BF 
> UTF8_EXT_80_BF
>          | '\u00F1'..'\u00F3' UTF8_EXT_80_BF UTF8_EXT_80_BF 
> UTF8_EXT_80_BF
>          | '\u00F4'           UTF8_EXT_80_8F UTF8_EXT_80_BF 
> UTF8_EXT_80_BF
>          ;
> 
>      protected UTF8_EXT_80_BF: '\u0080'..'\u00BF' ;
>      protected UTF8_EXT_80_8F: '\u0080'..'\u008F' ;
>      protected UTF8_EXT_90_BF: '\u0090'..'\u00BF' ;
>      protected UTF8_EXT_A0_BF: '\u00A0'..'\u00BF' ;
> 
> This will accept any Unicode character, legally encoded in UTF-8.  
Note 
> that the '\uXXXX' notation is being used here to specify only 8-bit 
> byte values, not actual Unicode characters.
> 
> Remember that the text of the STRING tokens will be UTF-8 encoded.  
You 
> could decode this back into Unicode strings either in the STRING 
rule 
> itself, or later in your parser or tree walker(s) as needed.
> 
> 	- Mark
> 
> Mark Lentczner
> markl at w...
> http://www.wheatfarm.org/

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/