[antlr-interest] Unicode handling
Mark Lentczner
markl at glyphic.com
Wed Apr 21 15:08:41 PDT 2004
My project's source files are Unicode, and we are using Antlr to
generate the lexer, parser and compiler in C++.
Seems from the doc that Antlr isn't really ready to deal with the full
compliment of Unicode characters. I found references to problems with
EOF (integer -1, typecast to 0xFFFF as a character), problems with
character sets (getting very large), and it seems that it assumes that
Unicode characters are only 16 bits (which is no longer true.)
So, rather than try to work around or fix these problems, I intend to
make my tool chain work with UTF-8 encoded source. (This is especially
easy for us, since the the process feeding the source stream already
normalizes the incoming character set to UTF-8.)
Instead of parsing Unicode:
NAME_START_CHAR:
':' | 'A'-'Z' | '_' | 'a'-'z'
| '\u00C0'-'\u00D6'
| '\u00D8'-'\u00F6'
| '\u00F8'-'\u02FF'
| '\u0370'-'\u037D'
| '\u037F'-'\u1FFF'
| '\u200C'-'\u200D'
| '\u2070'-'\u218F'
| '\u2C00'-'\u2FEF'
| '\u3001'-'\uD7FF'
| '\uF900'-'\uFDCF'
| '\uFDF0'-'\uFFFD'
| '\u10000'-'\uEFFFF' // won't work in Antlr as it can't handle
these Unicode chars
;
We'd be parsing the UTF-8 encoded version of these characters:
NAME_START_CHAR:
':' | 'A'..'Z' | '_' | 'a'..'z'
| '\u00C3' '\u0080'..'\u0096' // '\u00C0'-'\u00D6'
| '\u00C3' '\u0098'..'\u00B6' // '\u00D8'-'\u00F6'
| '\u00C3' '\u00B8'..'\u00BF' // '\u00F8'-'\u00FF'
| '\u00C4'..'\u00CB' '\u0080'..'\u00BF' // '\u0100'-'\u02FF'
| '\u00CD' '\u00B0'..'\u00BD' // '\u0370'-'\u037D'
| '\u00CD' '\u00BF' // '\u037F'
| '\u00CE'..'\u00DF' '\u0080'..'\u00BF' // '\u0380'-'\u07FF'
| '\u00E0' '\u00A0'..'\u00BF' '\u0080'..'\u00BF' //
'\u0800'-'\u0FFF'
| '\u00E1' '\u0090'..'\u00BF' '\u0080'..'\u00BF' //
'\u1000'-'\u1FFF'
... and so on ...
;
Does anyone see any pitfalls to this other than increasing the look
ahead for the lexer? Since in our source language, all the meaningful
punctuation is in the visible US-ASCII range, the only place the
difference between parsing Unicode characters vs. UTF-8 encoded Unicode
characters would be in things like the NAME token production.
This seems much more preferable to me than extending the C++ support
with some Unicode library (like IBM's icu or some such).
- Mark
Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list