[antlr-interest] [ANTLR C 3.1.3] Error when parsing international characters

Andy Grove andy.grove at codefutures.com
Thu Jun 18 14:59:53 PDT 2009


I realized that I was inadvertently passing zero as the length into  
the input stream. I fixed this and I am passing the correct number of  
characters now. However, the lexer/parser is invoking my ANY catchall  
code for each character in the UCS-2 input stream. Do I need to make  
my grammar UCS-2 format instead of ASCII to match the characters?

The code I am using now:

  iconv_t cd = iconv_open ("UTF-8", "UCS-2");

         char *inbuf = new char[ 8192 ];
         sprintf(inbuf, "%s", sql.c_str());
         size_t insize = strlen(inbuf);
         size_t outsize = 8192;
         char *outbuf = new char[ outsize ];
         iconv (cd, &inbuf, &insize, &outbuf, &outsize);
         input =  
antlr3NewUCS2StringInPlaceStream((pANTLR3_UINT16)outbuf, 8192-outsize,  
NULL);

The UTF-8 input string is 55 characters and the UCS2 string after  
conversion is 84 bytes.

Thanks,

Andy.

On Jun 18, 2009, at 3:26 PM, Andy Grove wrote:

> Jim,
>
> Thanks. I am a step further along now. I used the iconv library to  
> convert the UTF-8 input into UCS-2 and then passed that into Antlr's  
> UCS2 input stream. I also added the ANY lexer rule as you suggested.  
> The generated parser now runs without error but does not seem to  
> match anything in the grammar, or at least none of the actions that  
> call back into my C code are being called. Do you have any  
> suggestions as to what is going wrong?
>
> FYI, here's the code I'm using for setting up the input stream for  
> this test.
>
> 	iconv_t cd = iconv_open ("UTF-8", "UCS-2");
>
>         char *inbuf = new char[ 8192 ];
>         sprintf(inbuf, "%s", sql.c_str());
>         size_t insize = strlen(inbuf);
>
>         size_t outsize = 8192;
>         char *outbuf = new char[ outsize ];
>
>         size_t s = iconv (cd, &inbuf, &insize, &outbuf, &outsize);
>
>         input =  
> antlr3NewUCS2StringInPlaceStream((pANTLR3_UINT16)outbuf, s, NULL);
>
>
> Thanks,
>
> Andy.
>
>
> On Jun 16, 2009, at 5:35 PM, Jim Idle wrote:
>
>> Andy,
>>
>> I think your likely issues are:
>>
>> 1) as mentioned earlier the length you are passing in is in bytes  
>> and the stream needs number of 16 bit chars ;
>> 2) the encoding isn't what you think it is and the 16 bit  
>> characters are whacked enough to blow your lexer. Make sure you  
>> have a catch all ANY token listed last in your lexer :
>>
>> ANY : . { printf("some message"); } ;
>>
>> 3) the memory you are passing is not converted to 16 bit correctly  
>> using the calls you have here.
>>
>> Something else.
>>
>> Sorry I can't get much further but trying to do everything by  
>> iPhone is a bit tricky.
>>
>> Jim
>>
>> On Jun 16, 2009, at 11:18 AM, Andy Grove  
>> <andy.grove at codefutures.com> wrote:
>>> Jim,
>>>
>>> Thanks. I've attempted to use the UCS input stream with this code:
>>>
>>> SymbolTable* SQLParser::parse(std::string sql) {
>>>
>>> 	....
>>>
>>> 	std::wstring wsql(sql.begin(), sql.end());
>>> 	const wchar_t *wsqlchars = wsql.c_str();
>>> 	input =  
>>> antlr3NewUCS2StringInPlaceStream((pANTLR3_UINT16)wsqlchars,  
>>> wsql.length(), NULL);
>>>
>>> 	...
>>>
>>> }
>>>
>>> Am I even close with this? It compiles OK but now when I run my  
>>> test the app becomes unresponsive and consumes all the available  
>>> RAM.
>>>
>>> Thanks,
>>>
>>> Andy.
>>>
>>>
>>> On Jun 16, 2009, at 9:21 AM, Jim Idle wrote:
>>>
>>>> You need the UCS version of the input stream or write a utf32  
>>>> input stream and use to pre-supplied UTF8 to UTF32 conversion  
>>>> routine.
>>>>
>>>> If you can wait until next reLease I will be supplying these  
>>>> ready made but they are not difficult to produce, just copy the  
>>>> others. Internally the euntime uses 32 bit unicode and dies not  
>>>> care how you provide these.
>>>>
>>>> Jim
>>>>
>>>> On Jun 16, 2009, at 9:20 AM, Andy Grove  
>>>> <andy.grove at codefutures.com> wrote:
>>>>
>>>>> I have a SQL parser that is working fine with standard ASCII  
>>>>> characters but if I try and insert data containing international  
>>>>> characters such as:
>>>>>
>>>>> "INSERT INTO customer (username, password, title, first_name,  
>>>>> last_name, addr_line1, addr_line2, addr_city, addr_state,  
>>>>> country_id) VALUES (''username123', 'password', 'Mr', 'Tåst',  
>>>>> 'Test', 'Test', 'Test', 'Test', 'TE', 1)"
>>>>>
>>>>> I get this error:
>>>>>
>>>>> -memory-(1) : lexer error 1 :
>>>>> 	Unexpected character at offset 179, near char(0XC3) :
>>>>> 	åst', 'Test', 'Test
>>>>>
>>>>> Here is my setup code:
>>>>>
>>>>> 	input =  
>>>>> antlr3NewAsciiStringInPlaceStream((pANTLR3_UINT8)stringCopy, l,  
>>>>> NULL);
>>>>> 	lexer = DbsMySQL_CPPLexerNew(input);
>>>>> 	tstream = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT,  
>>>>> lexer->pLexer->rec->state->tokSource);
>>>>> 	parser = DbsMySQL_CPPParserNew(tstream);
>>>>>
>>>>> Do I need to specify the character set somewhere?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Andy.
>>>>>
>>>>> ---
>>>>> Andy Grove
>>>>> Chief Architect
>>>>> CodeFutures Corporation
>>>>> "Share Nothing. Shard Everything."
>>>>>
>>>>> Cell:    (303) 720-1285
>>>>> E-Fax:   (303) 395-0426
>>>>> Web:     http://www.codefutures.com/
>>>>> Twitter: http://twitter.com/andygrove73
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>>>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>
>>> t">http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>
>>> /html>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090618/13352341/attachment.html 


More information about the antlr-interest mailing list