[antlr-interest] Unicode handling

Sebastian Kaliszewski sk at z.pl
Fri Apr 23 02:13:18 PDT 2004


Mark Lentczner wrote:
>>Well the C++ stuff truncates to 8 bits whenever it sees fit.
> 
> My plan is to not do anything in 16 bits.  Just lex the UTF-8 as a pure 
> 8-bit stream.  So, so long as the Antlr generator and the generated C++ 
> code and the support lib are all 8-bit character clean, I'm home free.
> 

Yup. Same here!

> Actually, the few productions that require interpretation of Unicode 
> characters (only allow alphabetics, for example) are complex enough 
> that I have a perl program that takes a Unicode set description, 
> generates the Antlr rules for matching the set when represented as 
> UTF-8 sequences.

Would you publish that on some Antlr-like licence (or any licence allowing 
unconstrained use of resultant grammar)? If the stuff is not inside ANTLR 
itself, it'd be nice to have some preprocessing thing handy.


[snip]
> If (and it is a big if), Antlr wanted to support the idea of "specify a 
> parser with a Unicode source  character set, but the generated parser 
> reads and parses the UTF-8 encoded stream representation"  I believe 
> that I can offer the code that would make this automatic.

That's exactly what I need. I don't need to parse UTF-16 or anything like 
that. UTF-8 is what I want.

> For example:
> 	options { charVocabulary: Unicode-via-UTF-8; }
> 	...
> 	ALPHA_OMEGA: "\u0391\u03A9" | "\u03B1\u03C9" ;
> 	DASHES: '\u2010'..'\u2015' ;
> 
> Would internally become:
> 	options { charVocabulary: '\u0000'..'\u00FF'; }
> 	...
> 	ALPHA_OMEGA: "\316\221\316\251" | "\316\261\317\211" ;
> 	DASHES: '\342' '\200' '\220'..'\225' ;
> 
> The only hitch is that the user would have to probably up the value of 
> k manually (I don't think I could or want to compute the "correct" new 
> value.)  I have an algorithm working that works for these and more 
> complicated cases as well. (It handles the XML 1.0 and XML 1.1 name 
> productions, which are pretty hairy!)

This is the problem, it makes it hackish, thus I suspect ANTLR maitainers 
won't put it in. But having such external preporcessing tool would be nice.

[snip]

rgds
-- 
Sebastian Kaliszewski


 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list