[antlr-interest] Unicode handling

Wed Apr 21 22:32:20 PDT 2004

Don't forget that you don't have to use ANTLR's lexer.  You can easily 
hook up another lexer to an ANTLR parser -- to the parser a lexer is 
just an object with a nextToken() method.  I have no idea what good 
Unicode lexers are out there targeting C++, but chances are good that 
there are some better than ANTLR's.  Ter has some significant 
improvements coming in ANTLR 3 for lexer generation.  ANTLR lexers 
aren't known to be very fast either (yet).

Monty Zukowski

ANTLR & Java Consultant -- http://www.codetransform.com
ANSI C/GCC transformation toolkit -- 
http://www.codetransform.com/gcc.html
Embrace the Decay -- http://www.codetransform.com/EmbraceDecay.html

On Apr 21, 2004, at 3:08 PM, Mark Lentczner wrote:

> My project's source files are Unicode, and we are using Antlr to
> generate the lexer, parser and compiler in C++.
>
> Seems from the doc that Antlr isn't really ready to deal with the full
> compliment of Unicode characters.  I found references to problems with
> EOF (integer -1, typecast to 0xFFFF as a character), problems with
> character sets (getting very large), and it seems that it assumes that
> Unicode characters are only 16 bits (which is no longer true.)
>
> So, rather than try to work around or fix these problems, I intend to
> make my tool chain work with UTF-8 encoded source.  (This is especially
> easy for us, since the the process feeding the source stream already
> normalizes the incoming character set to UTF-8.)
>
> Instead of parsing Unicode:
>
> NAME_START_CHAR:
>      ':' | 'A'-'Z' | '_' | 'a'-'z'
>      | '\u00C0'-'\u00D6'
>      | '\u00D8'-'\u00F6'
>      | '\u00F8'-'\u02FF'
>      | '\u0370'-'\u037D'
>      | '\u037F'-'\u1FFF'
>      | '\u200C'-'\u200D'
>      | '\u2070'-'\u218F'
>      | '\u2C00'-'\u2FEF'
>      | '\u3001'-'\uD7FF'
>      | '\uF900'-'\uFDCF'
>      | '\uFDF0'-'\uFFFD'
>      | '\u10000'-'\uEFFFF'	// won't work in Antlr as it can't handle
> these Unicode chars
>      ;
>
> We'd be parsing the UTF-8 encoded version of these characters:
>
> NAME_START_CHAR:
>      ':' | 'A'..'Z' | '_' | 'a'..'z'
>      | '\u00C3' '\u0080'..'\u0096'           // '\u00C0'-'\u00D6'
>      | '\u00C3' '\u0098'..'\u00B6'           // '\u00D8'-'\u00F6'
>      | '\u00C3' '\u00B8'..'\u00BF'           // '\u00F8'-'\u00FF'
>      | '\u00C4'..'\u00CB' '\u0080'..'\u00BF' // '\u0100'-'\u02FF'
>      | '\u00CD' '\u00B0'..'\u00BD'           // '\u0370'-'\u037D'
>      | '\u00CD' '\u00BF'                     // '\u037F'
>      | '\u00CE'..'\u00DF' '\u0080'..'\u00BF' // '\u0380'-'\u07FF'
>      | '\u00E0' '\u00A0'..'\u00BF' '\u0080'..'\u00BF'    //
> '\u0800'-'\u0FFF'
>      | '\u00E1' '\u0090'..'\u00BF' '\u0080'..'\u00BF'    //
> '\u1000'-'\u1FFF'
>      ... and so on ...
>      ;
>
> Does anyone see any pitfalls to this other than increasing the look
> ahead for the lexer?  Since in our source language, all the meaningful
> punctuation is in the visible US-ASCII range, the only place the
> difference between parsing Unicode characters vs. UTF-8 encoded Unicode
> characters would be in things like the NAME token production.
>
> This seems much more preferable to me than extending the C++ support
> with some Unicode library (like IBM's icu or some such).
>
> - Mark
>
>
> Mark Lentczner
> markl at wheatfarm.org
> http://www.wheatfarm.org/
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/