[antlr-interest] unicode strings using supplemental char range

Ric Klaren klaren at cs.utwente.nl
Mon Jun 28 08:12:56 PDT 2004


On Mon, Jun 28, 2004 at 04:46:28PM +0200, Sebastian Kaliszewski wrote:
> I'm writting an interpreter of not too complicated OO language. But even at
> the early stage when I've just basic interpreter working but with
> minimalistic run-time support the UPX compressed statically linked stripped
> executable is already above 600KB and uncompressed dynamically linked debug
> executable is above 3MB (and most of that is ANTLR generated stuff). I'm
> affraid of including yet another heavy library will add too much weight.

I would not want to make ICU a default thing. It's just plain too much
bagage, at least I don't want to see it end up in my projects where the
input is plain ASCII. But for some cases it might be desirable to be able
to plug some ICU classes into a lexer/parser chain.

Note: You can probably cut off 100k from the support lib by removing the
'hidden' versions of some classes (on x86). The generated code you can
probably tinker a bit with the bitset gen options and see what produces
smaller code if it really becomes an issue.

> Well, if you accept that users of ANTLR must have standard compliant C++
> library (as standard is now 5 years old I don't think it's requiring too
> much, and for those using ancient stuff there is free STLport as well as few
> commercial offerings to use) then getting basic stuff done is not much more
> than making char_traits class for 32bit integers + some simple,
> non-normalizing utf-8 decoder (in a streambuf?).

I hope so :)

> If you need/want a volunteer to do such an skeleton stuff, I'm here.

Adding you to the list :)

> > I'm not sure but don't we need this on the C++ side? We have no java to
> > this stuff for us. So at least I'd like to be able to feed a UTF-8 string
> > to ANTLR out of the box without deriving some custom classes. Live is a lot
> > easier on the lexer side if it only has to deal with unicode characters. So
> > some input stream decoding will be necessary. On second thought this
> > probably does not need transcoding/normalization just the decoding for
> > UTF-xx.
> >
> > Question: How to deal with the unicode characters that are beyond 32 bits
> > they'd need a more expensive struct to have all bits or do we have to make
> > the lexers operate on UTF-32.
>
> ???
> Well, it's in the Unicode FAQ, taht they're even not going beyond 21bits, so
> 32bit's seems safe for a foreseable future.

/Add's coffee /hits himself for the head ;) I was babbling nonsense there.

Cheers,

Ric
--
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893755  ----
-----+++++*****************************************************+++++++++-------
  Before they invented drawing boards, what did they go back to?



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list