[antlr-interest] Unicode handling
Sebastian Kaliszewski
sk at z.pl
Fri Apr 23 02:13:18 PDT 2004
Mark Lentczner wrote:
>>Well the C++ stuff truncates to 8 bits whenever it sees fit.
>
> My plan is to not do anything in 16 bits. Just lex the UTF-8 as a pure
> 8-bit stream. So, so long as the Antlr generator and the generated C++
> code and the support lib are all 8-bit character clean, I'm home free.
>
Yup. Same here!
> Actually, the few productions that require interpretation of Unicode
> characters (only allow alphabetics, for example) are complex enough
> that I have a perl program that takes a Unicode set description,
> generates the Antlr rules for matching the set when represented as
> UTF-8 sequences.
Would you publish that on some Antlr-like licence (or any licence allowing
unconstrained use of resultant grammar)? If the stuff is not inside ANTLR
itself, it'd be nice to have some preprocessing thing handy.
[snip]
> If (and it is a big if), Antlr wanted to support the idea of "specify a
> parser with a Unicode source character set, but the generated parser
> reads and parses the UTF-8 encoded stream representation" I believe
> that I can offer the code that would make this automatic.
That's exactly what I need. I don't need to parse UTF-16 or anything like
that. UTF-8 is what I want.
> For example:
> options { charVocabulary: Unicode-via-UTF-8; }
> ...
> ALPHA_OMEGA: "\u0391\u03A9" | "\u03B1\u03C9" ;
> DASHES: '\u2010'..'\u2015' ;
>
> Would internally become:
> options { charVocabulary: '\u0000'..'\u00FF'; }
> ...
> ALPHA_OMEGA: "\316\221\316\251" | "\316\261\317\211" ;
> DASHES: '\342' '\200' '\220'..'\225' ;
>
> The only hitch is that the user would have to probably up the value of
> k manually (I don't think I could or want to compute the "correct" new
> value.) I have an algorithm working that works for these and more
> complicated cases as well. (It handles the XML 1.0 and XML 1.1 name
> productions, which are pretty hairy!)
This is the problem, it makes it hackish, thus I suspect ANTLR maitainers
won't put it in. But having such external preporcessing tool would be nice.
[snip]
rgds
--
Sebastian Kaliszewski
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list