[antlr-interest] Unicode XID_Start/XID_Continue? (And, , other Unicode properties)

Sun Jul 6 15:10:30 PDT 2008

> On Sun, 2008-07-06 at 16:49 +0200, Joe wrote:
>
>   
>> Johannes Luber wrote:
>>     
>>> Joe schrieb:
>>>       
>>>> Gavin Lambert wrote:
>>>>         
>
>
>
>   
>>> Johannes
>>>       
>> That solves the problem of recognizing them as pairs that belong 
>> together. Too bad it can't really replace the uncicode categories. Do 
>> you think it would be possible to integrate a handwritten Lexer using 
>> ICU with ANTLR generated parser and tree parser? I couldn't find much 
>> info on the C interface 
>>     
>
> For ICU or for ANTLR? The API documentation for the C interface is at:
>
> http://www.antlr.org/api/C/index.html
>   

ANTLR. I did find that document but found it somewhat hard to 
understand. Partially because I'm used to C++ APIs, not C APIs, but also 
because some sections with 'detailed information' (Interacting with the 
Generated Code for example) don't appear to exist.

>
> However, it is much simpler than you think in ANTLR 3.1, unless you
> don't want to pre-load the input. ANTLR 3.1 has an encoding conversion
> library (from the Unicode standard). If your input is not UCS2 and you
> must deal with surrogate pairs, then by far the easiest solution is to
> make a copy of the file antlr3ucs2inputstream.c and call it
> antlr3utf32inputstream, renaming the constructors accordingly. Then
> change the few references to {p}ANTLR3_UINT16 to {p}ANTLR3_UINT32. 
>
> As per the docs, the C runtime works internally with 32 bit characters,
> hence the lexer is divorced from the input stream and doesn't care how
> you produce it. If you don't convert the input to a fixed width
> encoding, then your LA() and related functions have to cater for the
> surrogate pair combinations, which is a pain, though you can do it, and
> will be slower. In the source file antlr3convertutf.c you will see a
> number of functions targeted for specific conversions, so, if your input
> is utf8 and the input codepoints would require surrogate pairs even in
> 16 bit encodings, you can use ConvertUTF8toUTF32(), and then open the
> result with your UTF32 input stream. Similarly, there is
> ConvertUTF16toUTF32().
>
> I do intend to rationalize this and provide an input stream that will do
> this internally, but other than a bit of copying and eidting, it is easy
> enough to create your own input encodings. Please read the numerous
> comments and the API docs if you wnat to do more than make a copy of the
> UCS2 input stream and have it process UTF32 characters. That should be
> all you need as internally it is designed from the ground up to cater
> for UTF32, as per the documentation.
>
>   
So, you say I should make a antlr3utf32inputstream, convert my input via 
ConvertUTF16toUTF32 and use that. But does this actually help with the 
grammar itself? How can I describe characters in the U+10000 to U+10FFFF 
range since there are no appropriate escape sequences?
>   
>> (wish there was a working C++ interface), so I'm 
>> not sure if that is feasible.
>>     
>
>
> See above and read the docs - you should be able to do this easily, and
> you don't need ICU to do it. I do agree that we should look at having at
> least a notation for  the Unicode character classes that at least know
> the character ranges and so on, if not some special states that know
> this internally. Easy enough for Java and C#, but a little more pain for
> C as I don't want to rely on third party libraries such as ICU, even
> though that is a very good library.
>
> Jim
Why not? ICU even has a Java, a C and a C++ version (Python bindings are 
also in development). And checking for Unicode properties requires only 
a single line of code using ICU and *all* Unicode properties are 
supported, even the derived ones like XID_Start. Do you really want to 
double all that effort? Also, for XID_Start to make any sense in the 
first place, the input must be NFKC normalized. Do you also want to 
start implementing normalization in ANTLR to have usable Unicode 
support? What I mean to say is: Using ICU would make life a lot easier 
for ANTLR developers, and IMO the benefits outweigh the disadvantages of 
having to include a third party library.