[antlr-interest] [ANTLR C 3.1.3] Error when parsing international characters
Andy Grove
andy.grove at codefutures.com
Thu Jun 18 14:59:53 PDT 2009
I realized that I was inadvertently passing zero as the length into
the input stream. I fixed this and I am passing the correct number of
characters now. However, the lexer/parser is invoking my ANY catchall
code for each character in the UCS-2 input stream. Do I need to make
my grammar UCS-2 format instead of ASCII to match the characters?
The code I am using now:
iconv_t cd = iconv_open ("UTF-8", "UCS-2");
char *inbuf = new char[ 8192 ];
sprintf(inbuf, "%s", sql.c_str());
size_t insize = strlen(inbuf);
size_t outsize = 8192;
char *outbuf = new char[ outsize ];
iconv (cd, &inbuf, &insize, &outbuf, &outsize);
input =
antlr3NewUCS2StringInPlaceStream((pANTLR3_UINT16)outbuf, 8192-outsize,
NULL);
The UTF-8 input string is 55 characters and the UCS2 string after
conversion is 84 bytes.
Thanks,
Andy.
On Jun 18, 2009, at 3:26 PM, Andy Grove wrote:
> Jim,
>
> Thanks. I am a step further along now. I used the iconv library to
> convert the UTF-8 input into UCS-2 and then passed that into Antlr's
> UCS2 input stream. I also added the ANY lexer rule as you suggested.
> The generated parser now runs without error but does not seem to
> match anything in the grammar, or at least none of the actions that
> call back into my C code are being called. Do you have any
> suggestions as to what is going wrong?
>
> FYI, here's the code I'm using for setting up the input stream for
> this test.
>
> iconv_t cd = iconv_open ("UTF-8", "UCS-2");
>
> char *inbuf = new char[ 8192 ];
> sprintf(inbuf, "%s", sql.c_str());
> size_t insize = strlen(inbuf);
>
> size_t outsize = 8192;
> char *outbuf = new char[ outsize ];
>
> size_t s = iconv (cd, &inbuf, &insize, &outbuf, &outsize);
>
> input =
> antlr3NewUCS2StringInPlaceStream((pANTLR3_UINT16)outbuf, s, NULL);
>
>
> Thanks,
>
> Andy.
>
>
> On Jun 16, 2009, at 5:35 PM, Jim Idle wrote:
>
>> Andy,
>>
>> I think your likely issues are:
>>
>> 1) as mentioned earlier the length you are passing in is in bytes
>> and the stream needs number of 16 bit chars ;
>> 2) the encoding isn't what you think it is and the 16 bit
>> characters are whacked enough to blow your lexer. Make sure you
>> have a catch all ANY token listed last in your lexer :
>>
>> ANY : . { printf("some message"); } ;
>>
>> 3) the memory you are passing is not converted to 16 bit correctly
>> using the calls you have here.
>>
>> Something else.
>>
>> Sorry I can't get much further but trying to do everything by
>> iPhone is a bit tricky.
>>
>> Jim
>>
>> On Jun 16, 2009, at 11:18 AM, Andy Grove
>> <andy.grove at codefutures.com> wrote:
>>> Jim,
>>>
>>> Thanks. I've attempted to use the UCS input stream with this code:
>>>
>>> SymbolTable* SQLParser::parse(std::string sql) {
>>>
>>> ....
>>>
>>> std::wstring wsql(sql.begin(), sql.end());
>>> const wchar_t *wsqlchars = wsql.c_str();
>>> input =
>>> antlr3NewUCS2StringInPlaceStream((pANTLR3_UINT16)wsqlchars,
>>> wsql.length(), NULL);
>>>
>>> ...
>>>
>>> }
>>>
>>> Am I even close with this? It compiles OK but now when I run my
>>> test the app becomes unresponsive and consumes all the available
>>> RAM.
>>>
>>> Thanks,
>>>
>>> Andy.
>>>
>>>
>>> On Jun 16, 2009, at 9:21 AM, Jim Idle wrote:
>>>
>>>> You need the UCS version of the input stream or write a utf32
>>>> input stream and use to pre-supplied UTF8 to UTF32 conversion
>>>> routine.
>>>>
>>>> If you can wait until next reLease I will be supplying these
>>>> ready made but they are not difficult to produce, just copy the
>>>> others. Internally the euntime uses 32 bit unicode and dies not
>>>> care how you provide these.
>>>>
>>>> Jim
>>>>
>>>> On Jun 16, 2009, at 9:20 AM, Andy Grove
>>>> <andy.grove at codefutures.com> wrote:
>>>>
>>>>> I have a SQL parser that is working fine with standard ASCII
>>>>> characters but if I try and insert data containing international
>>>>> characters such as:
>>>>>
>>>>> "INSERT INTO customer (username, password, title, first_name,
>>>>> last_name, addr_line1, addr_line2, addr_city, addr_state,
>>>>> country_id) VALUES (''username123', 'password', 'Mr', 'Tåst',
>>>>> 'Test', 'Test', 'Test', 'Test', 'TE', 1)"
>>>>>
>>>>> I get this error:
>>>>>
>>>>> -memory-(1) : lexer error 1 :
>>>>> Unexpected character at offset 179, near char(0XC3) :
>>>>> åst', 'Test', 'Test
>>>>>
>>>>> Here is my setup code:
>>>>>
>>>>> input =
>>>>> antlr3NewAsciiStringInPlaceStream((pANTLR3_UINT8)stringCopy, l,
>>>>> NULL);
>>>>> lexer = DbsMySQL_CPPLexerNew(input);
>>>>> tstream = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT,
>>>>> lexer->pLexer->rec->state->tokSource);
>>>>> parser = DbsMySQL_CPPParserNew(tstream);
>>>>>
>>>>> Do I need to specify the character set somewhere?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Andy.
>>>>>
>>>>> ---
>>>>> Andy Grove
>>>>> Chief Architect
>>>>> CodeFutures Corporation
>>>>> "Share Nothing. Shard Everything."
>>>>>
>>>>> Cell: (303) 720-1285
>>>>> E-Fax: (303) 395-0426
>>>>> Web: http://www.codefutures.com/
>>>>> Twitter: http://twitter.com/andygrove73
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>>>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>
>>> t">http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>
>>> /html>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090618/13352341/attachment.html
More information about the antlr-interest
mailing list