[antlr-interest] unicode strings using supplemental char range

Mon Jun 28 07:46:28 PDT 2004

Hello!
2 cents from (probably) typical ANTLR/C++ user...

I'm writting an interpreter of not too complicated OO language. But even at 
the early stage when I've just basic interpreter working but with 
minimalistic run-time support the UPX compressed statically linked stripped 
executable is already above 600KB and uncompressed dynamically linked debug 
executable is above 3MB (and most of that is ANTLR generated stuff). I'm 
affraid of including yet another heavy library will add too much weight.

Ric Klaren wrote:
> On Sat, Jun 26, 2004 at 08:48:34PM -0700, Mark Lentczner wrote:
> 
>>[ Terminology: "char" = Java type, 16-bits, "character" = Unicode
>>character in range 0..0x10FFFF, "String" = Java type, "unicode string"
>>= a sequence of zero or more characters. ]
>>
>>	- the ability to read a stream of characters
>>	- the ability to take a String written in a grammar file (i.e.
>>"abc\u20ac123") and produce a unicode string from it (i.e. [ 97, 98,
>>99, 8364, 49, 50, 41 ] - or if you mailer can handle it: [ 'a', 'b',
>>'c', '?', '1, '2', '3' ])
>>
>>Supporting these requirements hardly needs something as heavy weight as
>>ICU for either Java or C++ parsers.  (ICU has things like calendar
>>handling, regex matching, and number formatting in it!)
> 
> 
> It's a pity they don't have a 'light' version indeed. That's also why I'm
> contemplating making some lightweight classes for some aspects we need
> inside ANTLR for the C++ side.

Well, if you accept that users of ANTLR must have standard compliant C++ 
library (as standard is now 5 years old I don't think it's requiring too 
much, and for those using ancient stuff there is free STLport as well as few 
commercial offerings to use) then getting basic stuff done is not much more 
than making char_traits class for 32bit integers + some simple, 
non-normalizing utf-8 decoder (in a streambuf?).

If you need/want a volunteer to do such an skeleton stuff, I'm here.

[snip]
> I'm not sure but don't we need this on the C++ side? We have no java to
> this stuff for us. So at least I'd like to be able to feed a UTF-8 string
> to ANTLR out of the box without deriving some custom classes. Live is a lot
> easier on the lexer side if it only has to deal with unicode characters. So
> some input stream decoding will be necessary. On second thought this
> probably does not need transcoding/normalization just the decoding for
> UTF-xx.
> 
> Question: How to deal with the unicode characters that are beyond 32 bits
> they'd need a more expensive struct to have all bits or do we have to make
> the lexers operate on UTF-32.

???
Well, it's in the Unicode FAQ, taht they're even not going beyond 21bits, so 
32bit's seems safe for a foreseable future.

> 
[snip]
> 

rgds
-- 
Sebastian Kaliszewski
inForma

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/