[antlr-interest] C++ code target

Jim Idle jimi at temporal-wave.com
Tue Apr 3 11:49:09 PDT 2007


What the C implementation does is deal with everything internally a
UTF-32, you can then supply an input stream that provides each character
as a 32 bit value, regardless of the input encoding (which the input
stream is responsible for dealing with). Because all the library code
then deal with 32 bit characters regardless of the input stream, there
is no need for anything to know about the size of th incoming characters
except the input stream itself, which may need to know how to rest to a
specific character offset etc. The advantage is that there is little if
any overhead. The token stream holds offsets that the input stream knows
how to convert to 'strings' if they are referenced. There is currently
support for latin-1 and UTF-16 (UCS2 I suppose) input streams and string
manipulations for both (which will probably be easier to handle in C++ I
suspect ;-).

If you really need a C++ interface and cannot wait for Ric's
implementation, then you could use the C output and create a wrapper
class for it? I was thinking of adding this to the output for C anyway
in fact so that you could include the header and it would be a class
definition if asked for.

Ter - perhaps we can consider that ability for a target to define
multiple output files (call lots of templates like headerfile() with the
same input as headerfile/outputfile ?). This would make it a bit neater
to generate a COM interface for instance - however it can all be done in
the same header file in the end of course, with # define.

Jim

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Don Caton
Sent: Saturday, March 31, 2007 3:38 PM
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] C++ code target

> fashion. Currently I intend to drop unicode support for now and first
> get a 8 bit version out.

Ric:

Please don't do that.  One of the biggest limitations in Antlr 2 is the
lack
of proper Unicode support.  

Why should the code have any dependence on the size of a character?
Please
don't make the same mistake in 3.0.  The lexer class should be a
template
class that takes the size of a character as a template parameter.  Then
there will be no need to go back and make another version for Unicode.
It
should not make any difference whether you are parsing 8 bit characters
or
16 bit characters or characters of any arbitrary length.





More information about the antlr-interest mailing list