[antlr-interest] 'C' code generator and Unicode

Thu Jul 12 12:04:09 PDT 2007

Jim

Not a problem. Works fine without the BOM marker and as you say not a hard thing to handle myself. Excellent work on the implementation by the way and thanks for the help.

Bob

-----Original Message-----
From: Jim Idle [mailto:jimi at temporal-wave.com]
Sent: 12 July 2007 18:25
To: Bob Cowdery; antlr-interest at antlr.org
Subject: RE: [antlr-interest] 'C' code generator and Unicode

Bob,

The input stream doesn't handle the BOM as you guess. I did start doing this, but then, as everything is overridable and extensible I thought that if one needed BOM handling, it could be done easily by overriding the methods on the standard UCS2 input stream.

So, the default implementation assums the that endianness is the same as the machine it is running on and just does (ANTLR3_UNIT32)(*((pANTLR3_UINT16)(input->data))).

Basically, at the time I wanted to get on with the trickier stuff of parsing. At some point soon it is time to come back and expand the input streams I guess.

Your easiest option for the moment is:

1) Test the BOM yourself;
2) Create the input stream with the pointer at the first non-BOM character;
3) If the BOM indicates different ordering to the native machine, then install your own version of the LA function that returns the UINT32 version of the character, picke up in the correct ordering.

That should be it basically.

Jim

> -----Original Message-----
> From: Bob Cowdery [mailto:Bob.Cowdery at smartlogic.com]
> Sent: Thursday, July 12, 2007 9:22 AM
> To: Jim Idle; antlr-interest at antlr.org
> Subject: RE: [antlr-interest] 'C' code generator and Unicode
> 
> Jim,
> 
> Thank you so much for the quick response. It's very nice to know when
> joining a new list that it's active.
> 
> I haven't digested all you said yet but I tried reading the file binary
> wise and passing in the UCS2 string. It looks a lot better but I guess
> I'm still doing something a little wrong. The output is this,
> 
> -memory-(1) : lexer error 3 :
>         1:1: Tokens : ( CAP | LWR | WHITESPACE | BOTH | FULL | ALLUPPER
> | ALLLOW
> ER | MIXED ); at offset 0, near char(0XFF) :
>          ?T
> FULL
> ALLUPPER
> ALLLOWER
> MIXED
> 
> but I also get an assertion failure which I have to ignore twice to get
> the output. The assertion says
> File: isctype.c Line: 56
> Expression: (unsigned)(C+1) <= 256
> 
> My guess at the parser failure is that it dosen't like the BOM marker.
> Do you haddle the BOM or expect the order to be LE or BE?
> 
> Bob,
> 
> UCS2 (which is UTF16 without surrogate support basically) works fine,
> but instead of using the default input stream you need to name the 16
> bit UCS2 version:
> 
> antlr3NewUCS2StringInPlaceStream() - In memory string
> 
> I have just noticed that I have not provided the equivalent function
> for files, so for the moment you will need to read it yourself and pass
> in  the pointer to the data.
> 
> I have not provided UTF32 input streams, but this is just a matter of
> copying the code for UCS2 and changing the casts from {p}ANTLR3_UINT16
> to {p}ANTLR3_UINT32.
> 
> Input streams for handling UTF-8 and UTF-16 with surrogates are a ways
> down the line from me typing fingers as there is a lot to do, but there
> is not a great deal of work here as the mark() an rewind() calls are
> set up to keep the absolute pointer rather than the character position
> for such streams.
> 
> Finally, not that the input stream is the only difference in the
> function, so a lexer/parser will work the same with any input stream at
> all because internally it uses UTF32. So, change the stream you create
> and otherwise use it exactly the same as your ASCII stream.
> 
> Finally, finally, note that the ANTLR3_STRING 'classes' you get from
> the tokens and so on will retain the encoding of the input stream, but
> otherwise you again use exactly the same methods on them (subStr,
> append, etc). Tehre are methods for appending 8 bit character strings
> to any type of string, which a re mainly useful when you are
> programming and want to add a string like "DEBUG: XXXXXX" to a 16 bi
> output string.
> 
> Jim
> 
> > -----Original Message-----
> > From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> > bounces at antlr.org] On Behalf Of Bob Cowdery
> > Sent: Thursday, July 12, 2007 7:28 AM
> > To: antlr-interest at antlr.org
> > Subject: [antlr-interest] 'C' code generator and Unicode
> >
> > Hi all,
> >
> > This is a first post to the list. I am using Antlr 3.0 with the C
> > runtime. I have managed to compile and run a simple grammar. My
> > question however is around Unicode support. I have tried every lexer
> I
> > can find but the only one that does what I expect so far is jFlex,
> but
> > java is not an option. For the test I have a number of files saved in
> > ASCII, UTF8, UTF16 and UTF32 which I am feeding through the lexer.
> The
> > grammar is very simple.
> >
> > grammar SimpleC;
> >
> > options {	language = C;}
> >
> > CAP		:	'\u0041'..'\u005a' ;
> > LWR		:	'\u0061'..'\u007a' ;
> > WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel =
> > HIDDEN; };
> >
> > BOTH		:	CAP | LWR ;
> > FULL		:	(CAP)(LWR)+ ;
> > ALLUPPER	:	CAP+ ;
> > ALLLOWER	:	LWR+ ;
> > MIXED		:	BOTH+ ;
> >
> >
> > atom	:	 FULL		{ printf( "FULL\n"); };
> > atom1	:	 ALLUPPER	{ printf( "ALLUPPER\n"); };
> > atom2	:	 ALLLOWER	{ printf( "ALLLOWER\n"); };
> > atom3	:	 MIXED	{ printf( "MIXED\n"); };
> >
> > If I feed the ASCII file (or UTF8 with single character codes)
> through
> > I get as expected.
> >
> > >From input: This IS some TExt
> > FULL
> > ALLUPPER
> > ALLLOWER
> > MIXED
> >
> > >From the UTF16 file I get:
> > (there are lots of these errors for every leading 00 in the UTF16
> text.
> > data-utf16-1.txt(1) : lexer error 3 :
> >         1:1: Tokens : ( CAP | LWR | WHITESPACE | BOTH | FULL |
> ALLUPPER
> > |
> > ER | MIXED ); at offset 35, near char(00) :
> >
> > FULL
> > data-utf16-1.txt(1)  : error 2 : Unexpected token, at offset -1
> >     near [Index: 0 (Start: 0-Stop: 2) =' ?T', type<4> Line: 1
> LinePos:-
> > 1]
> >      : expected FULL ...
> > ALLUPPER
> > ALLLOWER
> > MIXED
> >
> > Although strangely it still gives output mixed in with errors.
> >
> > I won't clutter the post up with UTF32 as it gives the same but 3
> times
> > the number of errors on '00'.
> >
> > It seems that the data is still being matched on bytes and not
> > characters. I know I probably need to give the lexer a wide input
> > stream but I can't figure how. The comments in the code suggest all
> > input is treated as UTF32 and confusingly there is also a
> > antlr3ucs2inputstream.c input stream file which suggests UCS2 support
> > but I've no idea how to use it.
> >
> > If anybody can provide some insight into how to make this work (UTF16
> > is my preferred format) it would be much appreciated.
> >
> > Regards
> > Bob