[antlr-interest] Unicode XID_Start/XID_Continue? (And, other Unicode properties)

Sun Jul 6 09:47:31 PDT 2008

On Sun, 2008-07-06 at 16:49 +0200, Joe wrote:

> Johannes Luber wrote:
> > Joe schrieb:
> >> Gavin Lambert wrote:

> > Johannes
> That solves the problem of recognizing them as pairs that belong 
> together. Too bad it can't really replace the uncicode categories. Do 
> you think it would be possible to integrate a handwritten Lexer using 
> ICU with ANTLR generated parser and tree parser? I couldn't find much 
> info on the C interface 

For ICU or for ANTLR? The API documentation for the C interface is at:

http://www.antlr.org/api/C/index.html

However, it is much simpler than you think in ANTLR 3.1, unless you
don't want to pre-load the input. ANTLR 3.1 has an encoding conversion
library (from the Unicode standard). If your input is not UCS2 and you
must deal with surrogate pairs, then by far the easiest solution is to
make a copy of the file antlr3ucs2inputstream.c and call it
antlr3utf32inputstream, renaming the constructors accordingly. Then
change the few references to {p}ANTLR3_UINT16 to {p}ANTLR3_UINT32. 

As per the docs, the C runtime works internally with 32 bit characters,
hence the lexer is divorced from the input stream and doesn't care how
you produce it. If you don't convert the input to a fixed width
encoding, then your LA() and related functions have to cater for the
surrogate pair combinations, which is a pain, though you can do it, and
will be slower. In the source file antlr3convertutf.c you will see a
number of functions targeted for specific conversions, so, if your input
is utf8 and the input codepoints would require surrogate pairs even in
16 bit encodings, you can use ConvertUTF8toUTF32(), and then open the
result with your UTF32 input stream. Similarly, there is
ConvertUTF16toUTF32().

I do intend to rationalize this and provide an input stream that will do
this internally, but other than a bit of copying and eidting, it is easy
enough to create your own input encodings. Please read the numerous
comments and the API docs if you wnat to do more than make a copy of the
UCS2 input stream and have it process UTF32 characters. That should be
all you need as internally it is designed from the ground up to cater
for UTF32, as per the documentation.

> (wish there was a working C++ interface), so I'm 
> not sure if that is feasible.

See above and read the docs - you should be able to do this easily, and
you don't need ICU to do it. I do agree that we should look at having at
least a notation for  the Unicode character classes that at least know
the character ranges and so on, if not some special states that know
this internally. Easy enough for Java and C#, but a little more pain for
C as I don't want to rely on third party libraries such as ICU, even
though that is a very good library.

Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080706/914db01a/attachment-0001.html