[antlr-interest] [ANTLR C 3.1.3] Error when parsing international characters

Jim Idle jimi at temporal-wave.com
Tue Jun 16 16:35:51 PDT 2009


Andy,

I think your likely issues are:

1) as mentioned earlier the length you are passing in is in bytes and  
the stream needs number of 16 bit chars ;
2) the encoding isn't what you think it is and the 16 bit characters  
are whacked enough to blow your lexer. Make sure you have a catch all  
ANY token listed last in your lexer :

ANY : . { printf("some message"); } ;

3) the memory you are passing is not converted to 16 bit correctly  
using the calls you have here.

Something else.

Sorry I can't get much further but trying to do everything by iPhone  
is a bit tricky.

Jim

On Jun 16, 2009, at 11:18 AM, Andy Grove <andy.grove at codefutures.com>  
wrote:
> Jim,
>
> Thanks. I've attempted to use the UCS input stream with this code:
>
> SymbolTable* SQLParser::parse(std::string sql) {
>
> 	....
>
> 	std::wstring wsql(sql.begin(), sql.end());
> 	const wchar_t *wsqlchars = wsql.c_str();
> 	input = antlr3NewUCS2StringInPlaceStream((pANTLR3_UINT16)wsqlchars,  
> wsql.length(), NULL);
>
> 	...
>
> }
>
> Am I even close with this? It compiles OK but now when I run my test  
> the app becomes unresponsive and consumes all the available RAM.
>
> Thanks,
>
> Andy.
>
>
> On Jun 16, 2009, at 9:21 AM, Jim Idle wrote:
>
>> You need the UCS version of the input stream or write a utf32 input  
>> stream and use to pre-supplied UTF8 to UTF32 conversion routine.
>>
>> If you can wait until next reLease I will be supplying these ready  
>> made but they are not difficult to produce, just copy the others.  
>> Internally the euntime uses 32 bit unicode and dies not care how  
>> you provide these.
>>
>> Jim
>>
>> On Jun 16, 2009, at 9:20 AM, Andy Grove  
>> <andy.grove at codefutures.com> wrote:
>>
>>> I have a SQL parser that is working fine with standard ASCII  
>>> characters but if I try and insert data containing international  
>>> characters such as:
>>>
>>> "INSERT INTO customer (username, password, title, first_name,  
>>> last_name, addr_line1, addr_line2, addr_city, addr_state,  
>>> country_id) VALUES (''username123', 'password', 'Mr', 'Tåst', 'T 
>>> est', 'Test', 'Test', 'Test', 'TE', 1)"
>>>
>>> I get this error:
>>>
>>> -memory-(1) : lexer error 1 :
>>> 	Unexpected character at offset 179, near char(0XC3) :
>>> 	åst', 'Test', 'Test
>>>
>>> Here is my setup code:
>>>
>>> 	input =  
>>> antlr3NewAsciiStringInPlaceStream((pANTLR3_UINT8)stringCopy, l,  
>>> NULL);
>>> 	lexer = DbsMySQL_CPPLexerNew(input);
>>> 	tstream = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT,  
>>> lexer->pLexer->rec->state->tokSource);
>>> 	parser = DbsMySQL_CPPParserNew(tstream);
>>>
>>> Do I need to specify the character set somewhere?
>>>
>>> Thanks,
>>>
>>> Andy.
>>>
>>> ---
>>> Andy Grove
>>> Chief Architect
>>> CodeFutures Corporation
>>> "Share Nothing. Shard Everything."
>>>
>>> Cell:    (303) 720-1285
>>> E-Fax:   (303) 395-0426
>>> Web:     http://www.codefutures.com/
>>> Twitter: http://twitter.com/andygrove73
>>>
>>>
>>>
>>>
>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
> t">http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
> /html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090616/6aa3d8e8/attachment.html 


More information about the antlr-interest mailing list