[antlr-interest] 'C' code generator and Unicode

Thu Jul 12 07:28:29 PDT 2007

Hi all,

This is a first post to the list. I am using Antlr 3.0 with the C runtime. I have managed to compile and run a simple grammar. My question however is around Unicode support. I have tried every lexer I can find but the only one that does what I expect so far is jFlex, but java is not an option. For the test I have a number of files saved in ASCII, UTF8, UTF16 and UTF32 which I am feeding through the lexer. The grammar is very simple.

grammar SimpleC;

options {	language = C;}

CAP		:	'\u0041'..'\u005a' ;
LWR		:	'\u0061'..'\u007a' ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; };

BOTH		:	CAP | LWR ;
FULL		:	(CAP)(LWR)+ ;
ALLUPPER	:	CAP+ ;
ALLLOWER	:	LWR+ ;
MIXED		:	BOTH+ ;

atom	:	 FULL		{ printf( "FULL\n"); };
atom1	:	 ALLUPPER	{ printf( "ALLUPPER\n"); };
atom2	:	 ALLLOWER	{ printf( "ALLLOWER\n"); };
atom3	:	 MIXED	{ printf( "MIXED\n"); }; 

If I feed the ASCII file (or UTF8 with single character codes) through I get as expected.

>From input: This IS some TExt
FULL
ALLUPPER
ALLLOWER
MIXED

>From the UTF16 file I get:
(there are lots of these errors for every leading 00 in the UTF16 text.
data-utf16-1.txt(1) : lexer error 3 :
        1:1: Tokens : ( CAP | LWR | WHITESPACE | BOTH | FULL | ALLUPPER |
ER | MIXED ); at offset 35, near char(00) :

FULL
data-utf16-1.txt(1)  : error 2 : Unexpected token, at offset -1
    near [Index: 0 (Start: 0-Stop: 2) =' ?T', type<4> Line: 1 LinePos:-1]
     : expected FULL ...
ALLUPPER
ALLLOWER
MIXED

Although strangely it still gives output mixed in with errors.

I won't clutter the post up with UTF32 as it gives the same but 3 times the number of errors on '00'.

It seems that the data is still being matched on bytes and not characters. I know I probably need to give the lexer a wide input stream but I can't figure how. The comments in the code suggest all input is treated as UTF32 and confusingly there is also a antlr3ucs2inputstream.c input stream file which suggests UCS2 support but I've no idea how to use it.

If anybody can provide some insight into how to make this work (UTF16 is my preferred format) it would be much appreciated.

Regards
Bob