[antlr-interest] [ANTLR C 3.1.3] Error when parsing international characters

Andy Grove andy.grove at codefutures.com
Thu Jun 18 14:26:13 PDT 2009


Jim,

Thanks. I am a step further along now. I used the iconv library to  
convert the UTF-8 input into UCS-2 and then passed that into Antlr's  
UCS2 input stream. I also added the ANY lexer rule as you suggested.  
The generated parser now runs without error but does not seem to match  
anything in the grammar, or at least none of the actions that call  
back into my C code are being called. Do you have any suggestions as  
to what is going wrong?

FYI, here's the code I'm using for setting up the input stream for  
this test.

	iconv_t cd = iconv_open ("UTF-8", "UCS-2");

         char *inbuf = new char[ 8192 ];
         sprintf(inbuf, "%s", sql.c_str());
         size_t insize = strlen(inbuf);

         size_t outsize = 8192;
         char *outbuf = new char[ outsize ];

         size_t s = iconv (cd, &inbuf, &insize, &outbuf, &outsize);

         input =  
antlr3NewUCS2StringInPlaceStream((pANTLR3_UINT16)outbuf, s, NULL);


Thanks,

Andy.


On Jun 16, 2009, at 5:35 PM, Jim Idle wrote:

> Andy,
>
> I think your likely issues are:
>
> 1) as mentioned earlier the length you are passing in is in bytes  
> and the stream needs number of 16 bit chars ;
> 2) the encoding isn't what you think it is and the 16 bit characters  
> are whacked enough to blow your lexer. Make sure you have a catch  
> all ANY token listed last in your lexer :
>
> ANY : . { printf("some message"); } ;
>
> 3) the memory you are passing is not converted to 16 bit correctly  
> using the calls you have here.
>
> Something else.
>
> Sorry I can't get much further but trying to do everything by iPhone  
> is a bit tricky.
>
> Jim
>
> On Jun 16, 2009, at 11:18 AM, Andy Grove  
> <andy.grove at codefutures.com> wrote:
>> Jim,
>>
>> Thanks. I've attempted to use the UCS input stream with this code:
>>
>> SymbolTable* SQLParser::parse(std::string sql) {
>>
>> 	....
>>
>> 	std::wstring wsql(sql.begin(), sql.end());
>> 	const wchar_t *wsqlchars = wsql.c_str();
>> 	input =  
>> antlr3NewUCS2StringInPlaceStream((pANTLR3_UINT16)wsqlchars,  
>> wsql.length(), NULL);
>>
>> 	...
>>
>> }
>>
>> Am I even close with this? It compiles OK but now when I run my  
>> test the app becomes unresponsive and consumes all the available RAM.
>>
>> Thanks,
>>
>> Andy.
>>
>>
>> On Jun 16, 2009, at 9:21 AM, Jim Idle wrote:
>>
>>> You need the UCS version of the input stream or write a utf32  
>>> input stream and use to pre-supplied UTF8 to UTF32 conversion  
>>> routine.
>>>
>>> If you can wait until next reLease I will be supplying these ready  
>>> made but they are not difficult to produce, just copy the others.  
>>> Internally the euntime uses 32 bit unicode and dies not care how  
>>> you provide these.
>>>
>>> Jim
>>>
>>> On Jun 16, 2009, at 9:20 AM, Andy Grove  
>>> <andy.grove at codefutures.com> wrote:
>>>
>>>> I have a SQL parser that is working fine with standard ASCII  
>>>> characters but if I try and insert data containing international  
>>>> characters such as:
>>>>
>>>> "INSERT INTO customer (username, password, title, first_name,  
>>>> last_name, addr_line1, addr_line2, addr_city, addr_state,  
>>>> country_id) VALUES (''username123', 'password', 'Mr', 'Tåst',  
>>>> 'Test', 'Test', 'Test', 'Test', 'TE', 1)"
>>>>
>>>> I get this error:
>>>>
>>>> -memory-(1) : lexer error 1 :
>>>> 	Unexpected character at offset 179, near char(0XC3) :
>>>> 	åst', 'Test', 'Test
>>>>
>>>> Here is my setup code:
>>>>
>>>> 	input =  
>>>> antlr3NewAsciiStringInPlaceStream((pANTLR3_UINT8)stringCopy, l,  
>>>> NULL);
>>>> 	lexer = DbsMySQL_CPPLexerNew(input);
>>>> 	tstream = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT,  
>>>> lexer->pLexer->rec->state->tokSource);
>>>> 	parser = DbsMySQL_CPPParserNew(tstream);
>>>>
>>>> Do I need to specify the character set somewhere?
>>>>
>>>> Thanks,
>>>>
>>>> Andy.
>>>>
>>>> ---
>>>> Andy Grove
>>>> Chief Architect
>>>> CodeFutures Corporation
>>>> "Share Nothing. Shard Everything."
>>>>
>>>> Cell:    (303) 720-1285
>>>> E-Fax:   (303) 395-0426
>>>> Web:     http://www.codefutures.com/
>>>> Twitter: http://twitter.com/andygrove73
>>>>
>>>>
>>>>
>>>>
>>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>> t">http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>> /html>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090618/4eacf711/attachment.html 


More information about the antlr-interest mailing list