[antlr-interest] Unicode handling

Thu Apr 22 13:01:20 PDT 2004

> Well the C++ stuff truncates to 8 bits whenever it sees fit.
My plan is to not do anything in 16 bits.  Just lex the UTF-8 as a pure 
8-bit stream.  So, so long as the Antlr generator and the generated C++ 
code and the support lib are all 8-bit character clean, I'm home free.

> Note that 2.7.4 will barf out attempts at 16 bit char constants.
Do you mean that 2.7.4 won't allow '\u00BF' but will allow '\277'?  Or 
will 2.7.4 only be upset if the upper byte isn't zero, i.e. '\u2004'?  
(Yes, I'm still on 2.7.3)

> If you trick antlr to make the right bitsets,
Er, is there a known issue with bitsets and 8-bit high characters?

> you may get away by handcoding/modifying the few rules that need to 
> deal with UTF8 multibyte
> sequences.
Actually, the few productions that require interpretation of Unicode 
characters (only allow alphabetics, for example) are complex enough 
that I have a perl program that takes a Unicode set description, 
generates the Antlr rules for matching the set when represented as 
UTF-8 sequences.

> The moment you put the icky bits in nice 8 bit strings you're
> basically homefree except for sorting out the actual lenghts of the 
> text
> etc.
All the literals in our source language have only 7-bit strings, so no 
concern here.  But even if we supported something like U+F7 (the 
division sign, ÷, UTF-8 encoded as 0xC3 0xB7), then I'd just code the 
literal as "\u00C3\u00B7" or "\303\267", and let Antlr think it is a 
two-character string.

> You could get away with redefining the strings in antlr to wchars and
> recompiling a hacked version of the support lib to have a bit more 
> 'room'
> to maneuver (sp?). That has been done before with some luck.
<speak voice="kid from Time Bandits"> No, don't touch it.... wchar is 
EEEEEEVIL! </speak>

> I commend you if you do it with the current support lib (in both cases 
> ;) )
Any code changes will be coming your way, Ric...

> Might be preferable over reinventing the wheel though. And for me a lot
> quicker to implement stuff (unless there's volunteers out there?).
If (and it is a big if), Antlr wanted to support the idea of "specify a 
parser with a Unicode source  character set, but the generated parser 
reads and parses the UTF-8 encoded stream representation"  I believe 
that I can offer the code that would make this automatic.

For example:
	options { charVocabulary: Unicode-via-UTF-8; }
	...
	ALPHA_OMEGA: "\u0391\u03A9" | "\u03B1\u03C9" ;
	DASHES: '\u2010'..'\u2015' ;

Would internally become:
	options { charVocabulary: '\u0000'..'\u00FF'; }
	...
	ALPHA_OMEGA: "\316\221\316\251" | "\316\261\317\211" ;
	DASHES: '\342' '\200' '\220'..'\225' ;

The only hitch is that the user would have to probably up the value of 
k manually (I don't think I could or want to compute the "correct" new 
value.)  I have an algorithm working that works for these and more 
complicated cases as well. (It handles the XML 1.0 and XML 1.1 name 
productions, which are pretty hairy!)

> And I also wonder what you'll get if you feed the lexer in java mode a
> sequence that contains such a value broken up over two UTF-16 values 
> (that
> for lexer terms should be treated as one!).
Java prior to 1.5 is blissfully unaware.  It will think of a UTF-16 
surrogate pair as two characters.  In 1.5 it will start thinking of the 
type 'char' as UTF-16 code value, not a Unicode char.  Not clear how 
this will affect things, but I doubt they'll break any old APIs.

What this means is that parsers currently built in Antlr really parse 
UTF-16 input, not Unicode.  So if you want to match U+1D11E (Musical 
Symbol G Clef), you have to match the string "\uD834\uDD1E".

- Mark

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/