[antlr-interest] [ANTLR C 3.1.3] Error when parsing international characters
Jim Idle
jimi at temporal-wave.com
Tue Jun 16 16:35:51 PDT 2009
Andy,
I think your likely issues are:
1) as mentioned earlier the length you are passing in is in bytes and
the stream needs number of 16 bit chars ;
2) the encoding isn't what you think it is and the 16 bit characters
are whacked enough to blow your lexer. Make sure you have a catch all
ANY token listed last in your lexer :
ANY : . { printf("some message"); } ;
3) the memory you are passing is not converted to 16 bit correctly
using the calls you have here.
Something else.
Sorry I can't get much further but trying to do everything by iPhone
is a bit tricky.
Jim
On Jun 16, 2009, at 11:18 AM, Andy Grove <andy.grove at codefutures.com>
wrote:
> Jim,
>
> Thanks. I've attempted to use the UCS input stream with this code:
>
> SymbolTable* SQLParser::parse(std::string sql) {
>
> ....
>
> std::wstring wsql(sql.begin(), sql.end());
> const wchar_t *wsqlchars = wsql.c_str();
> input = antlr3NewUCS2StringInPlaceStream((pANTLR3_UINT16)wsqlchars,
> wsql.length(), NULL);
>
> ...
>
> }
>
> Am I even close with this? It compiles OK but now when I run my test
> the app becomes unresponsive and consumes all the available RAM.
>
> Thanks,
>
> Andy.
>
>
> On Jun 16, 2009, at 9:21 AM, Jim Idle wrote:
>
>> You need the UCS version of the input stream or write a utf32 input
>> stream and use to pre-supplied UTF8 to UTF32 conversion routine.
>>
>> If you can wait until next reLease I will be supplying these ready
>> made but they are not difficult to produce, just copy the others.
>> Internally the euntime uses 32 bit unicode and dies not care how
>> you provide these.
>>
>> Jim
>>
>> On Jun 16, 2009, at 9:20 AM, Andy Grove
>> <andy.grove at codefutures.com> wrote:
>>
>>> I have a SQL parser that is working fine with standard ASCII
>>> characters but if I try and insert data containing international
>>> characters such as:
>>>
>>> "INSERT INTO customer (username, password, title, first_name,
>>> last_name, addr_line1, addr_line2, addr_city, addr_state,
>>> country_id) VALUES (''username123', 'password', 'Mr', 'Tåst', 'T
>>> est', 'Test', 'Test', 'Test', 'TE', 1)"
>>>
>>> I get this error:
>>>
>>> -memory-(1) : lexer error 1 :
>>> Unexpected character at offset 179, near char(0XC3) :
>>> åst', 'Test', 'Test
>>>
>>> Here is my setup code:
>>>
>>> input =
>>> antlr3NewAsciiStringInPlaceStream((pANTLR3_UINT8)stringCopy, l,
>>> NULL);
>>> lexer = DbsMySQL_CPPLexerNew(input);
>>> tstream = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT,
>>> lexer->pLexer->rec->state->tokSource);
>>> parser = DbsMySQL_CPPParserNew(tstream);
>>>
>>> Do I need to specify the character set somewhere?
>>>
>>> Thanks,
>>>
>>> Andy.
>>>
>>> ---
>>> Andy Grove
>>> Chief Architect
>>> CodeFutures Corporation
>>> "Share Nothing. Shard Everything."
>>>
>>> Cell: (303) 720-1285
>>> E-Fax: (303) 395-0426
>>> Web: http://www.codefutures.com/
>>> Twitter: http://twitter.com/andygrove73
>>>
>>>
>>>
>>>
>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
> t">http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
> /html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090616/6aa3d8e8/attachment.html
More information about the antlr-interest
mailing list