[antlr-interest] 'C' code generator and Unicode

Thu Jul 12 09:21:39 PDT 2007

Jim,

Thank you so much for the quick response. It's very nice to know when joining a new list that it's active.

I haven't digested all you said yet but I tried reading the file binary wise and passing in the UCS2 string. It looks a lot better but I guess I'm still doing something a little wrong. The output is this,

-memory-(1) : lexer error 3 :
        1:1: Tokens : ( CAP | LWR | WHITESPACE | BOTH | FULL | ALLUPPER | ALLLOW
ER | MIXED ); at offset 0, near char(0XFF) :
         ?T
FULL
ALLUPPER
ALLLOWER
MIXED

but I also get an assertion failure which I have to ignore twice to get the output. The assertion says
File: isctype.c Line: 56
Expression: (unsigned)(C+1) <= 256

My guess at the parser failure is that it dosen't like the BOM marker. Do you haddle the BOM or expect the order to be LE or BE?

Bob,

UCS2 (which is UTF16 without surrogate support basically) works fine, but instead of using the default input stream you need to name the 16 bit UCS2 version:

antlr3NewUCS2StringInPlaceStream() - In memory string

I have just noticed that I have not provided the equivalent function for files, so for the moment you will need to read it yourself and pass in  the pointer to the data.

I have not provided UTF32 input streams, but this is just a matter of copying the code for UCS2 and changing the casts from {p}ANTLR3_UINT16 to {p}ANTLR3_UINT32.

Input streams for handling UTF-8 and UTF-16 with surrogates are a ways down the line from me typing fingers as there is a lot to do, but there is not a great deal of work here as the mark() an rewind() calls are set up to keep the absolute pointer rather than the character position for such streams.

Finally, not that the input stream is the only difference in the function, so a lexer/parser will work the same with any input stream at all because internally it uses UTF32. So, change the stream you create and otherwise use it exactly the same as your ASCII stream. 

Finally, finally, note that the ANTLR3_STRING 'classes' you get from the tokens and so on will retain the encoding of the input stream, but otherwise you again use exactly the same methods on them (subStr, append, etc). Tehre are methods for appending 8 bit character strings to any type of string, which a re mainly useful when you are programming and want to add a string like "DEBUG: XXXXXX" to a 16 bi output string.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Bob Cowdery
> Sent: Thursday, July 12, 2007 7:28 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] 'C' code generator and Unicode
> 
> Hi all,
> 
> This is a first post to the list. I am using Antlr 3.0 with the C
> runtime. I have managed to compile and run a simple grammar. My
> question however is around Unicode support. I have tried every lexer I
> can find but the only one that does what I expect so far is jFlex, but
> java is not an option. For the test I have a number of files saved in
> ASCII, UTF8, UTF16 and UTF32 which I am feeding through the lexer. The
> grammar is very simple.
> 
> grammar SimpleC;
> 
> options {	language = C;}
> 
> CAP		:	'\u0041'..'\u005a' ;
> LWR		:	'\u0061'..'\u007a' ;
> WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel =
> HIDDEN; };
> 
> BOTH		:	CAP | LWR ;
> FULL		:	(CAP)(LWR)+ ;
> ALLUPPER	:	CAP+ ;
> ALLLOWER	:	LWR+ ;
> MIXED		:	BOTH+ ;
> 
> 
> atom	:	 FULL		{ printf( "FULL\n"); };
> atom1	:	 ALLUPPER	{ printf( "ALLUPPER\n"); };
> atom2	:	 ALLLOWER	{ printf( "ALLLOWER\n"); };
> atom3	:	 MIXED	{ printf( "MIXED\n"); };
> 
> If I feed the ASCII file (or UTF8 with single character codes) through
> I get as expected.
> 
> >From input: This IS some TExt
> FULL
> ALLUPPER
> ALLLOWER
> MIXED
> 
> >From the UTF16 file I get:
> (there are lots of these errors for every leading 00 in the UTF16 text.
> data-utf16-1.txt(1) : lexer error 3 :
>         1:1: Tokens : ( CAP | LWR | WHITESPACE | BOTH | FULL | ALLUPPER
> |
> ER | MIXED ); at offset 35, near char(00) :
> 
> FULL
> data-utf16-1.txt(1)  : error 2 : Unexpected token, at offset -1
>     near [Index: 0 (Start: 0-Stop: 2) =' ?T', type<4> Line: 1 LinePos:-
> 1]
>      : expected FULL ...
> ALLUPPER
> ALLLOWER
> MIXED
> 
> Although strangely it still gives output mixed in with errors.
> 
> I won't clutter the post up with UTF32 as it gives the same but 3 times
> the number of errors on '00'.
> 
> It seems that the data is still being matched on bytes and not
> characters. I know I probably need to give the lexer a wide input
> stream but I can't figure how. The comments in the code suggest all
> input is treated as UTF32 and confusingly there is also a
> antlr3ucs2inputstream.c input stream file which suggests UCS2 support
> but I've no idea how to use it.
> 
> If anybody can provide some insight into how to make this work (UTF16
> is my preferred format) it would be much appreciated.
> 
> Regards
> Bob