[antlr-interest] Unicode handling
Monty Zukowski
monty at codetransform.com
Wed Apr 21 22:32:20 PDT 2004
Don't forget that you don't have to use ANTLR's lexer. You can easily
hook up another lexer to an ANTLR parser -- to the parser a lexer is
just an object with a nextToken() method. I have no idea what good
Unicode lexers are out there targeting C++, but chances are good that
there are some better than ANTLR's. Ter has some significant
improvements coming in ANTLR 3 for lexer generation. ANTLR lexers
aren't known to be very fast either (yet).
Monty Zukowski
ANTLR & Java Consultant -- http://www.codetransform.com
ANSI C/GCC transformation toolkit --
http://www.codetransform.com/gcc.html
Embrace the Decay -- http://www.codetransform.com/EmbraceDecay.html
On Apr 21, 2004, at 3:08 PM, Mark Lentczner wrote:
> My project's source files are Unicode, and we are using Antlr to
> generate the lexer, parser and compiler in C++.
>
> Seems from the doc that Antlr isn't really ready to deal with the full
> compliment of Unicode characters. I found references to problems with
> EOF (integer -1, typecast to 0xFFFF as a character), problems with
> character sets (getting very large), and it seems that it assumes that
> Unicode characters are only 16 bits (which is no longer true.)
>
> So, rather than try to work around or fix these problems, I intend to
> make my tool chain work with UTF-8 encoded source. (This is especially
> easy for us, since the the process feeding the source stream already
> normalizes the incoming character set to UTF-8.)
>
> Instead of parsing Unicode:
>
> NAME_START_CHAR:
> ':' | 'A'-'Z' | '_' | 'a'-'z'
> | '\u00C0'-'\u00D6'
> | '\u00D8'-'\u00F6'
> | '\u00F8'-'\u02FF'
> | '\u0370'-'\u037D'
> | '\u037F'-'\u1FFF'
> | '\u200C'-'\u200D'
> | '\u2070'-'\u218F'
> | '\u2C00'-'\u2FEF'
> | '\u3001'-'\uD7FF'
> | '\uF900'-'\uFDCF'
> | '\uFDF0'-'\uFFFD'
> | '\u10000'-'\uEFFFF' // won't work in Antlr as it can't handle
> these Unicode chars
> ;
>
> We'd be parsing the UTF-8 encoded version of these characters:
>
> NAME_START_CHAR:
> ':' | 'A'..'Z' | '_' | 'a'..'z'
> | '\u00C3' '\u0080'..'\u0096' // '\u00C0'-'\u00D6'
> | '\u00C3' '\u0098'..'\u00B6' // '\u00D8'-'\u00F6'
> | '\u00C3' '\u00B8'..'\u00BF' // '\u00F8'-'\u00FF'
> | '\u00C4'..'\u00CB' '\u0080'..'\u00BF' // '\u0100'-'\u02FF'
> | '\u00CD' '\u00B0'..'\u00BD' // '\u0370'-'\u037D'
> | '\u00CD' '\u00BF' // '\u037F'
> | '\u00CE'..'\u00DF' '\u0080'..'\u00BF' // '\u0380'-'\u07FF'
> | '\u00E0' '\u00A0'..'\u00BF' '\u0080'..'\u00BF' //
> '\u0800'-'\u0FFF'
> | '\u00E1' '\u0090'..'\u00BF' '\u0080'..'\u00BF' //
> '\u1000'-'\u1FFF'
> ... and so on ...
> ;
>
> Does anyone see any pitfalls to this other than increasing the look
> ahead for the lexer? Since in our source language, all the meaningful
> punctuation is in the visible US-ASCII range, the only place the
> difference between parsing Unicode characters vs. UTF-8 encoded Unicode
> characters would be in things like the NAME token production.
>
> This seems much more preferable to me than extending the C++ support
> with some Unicode library (like IBM's icu or some such).
>
> - Mark
>
>
> Mark Lentczner
> markl at wheatfarm.org
> http://www.wheatfarm.org/
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list