[antlr-interest] We Need UTF-8 String Literal Support for C Parser/Lexer

banmate6 at aol.com banmate6 at aol.com
Tue Jul 17 14:44:18 PDT 2012



 
Hello Folks

I have a basic expression grammar that specifies the production of abstract syntax trees (AST) of:

column (this represents a database column)

functions

Boolean: "and", "or", "not" 

Equality/Relational: "=", "!=", ">", etc
Arithmetic: "+", "-", etc


literals

int, float
string


In our case, we have an expression of the following form, taken from a tag in an XML document using the TinyXML C API. 



    col1 = "UTF-8 string"


The AST looks as so, as might be expected:


relational node, "=" function
    child node 1, column
    child node 2, literal string


Unfortunately, the literal string in child node 2 is incorrectly a 4 byte string, when in the original UTF-8 it is 6 bytes. We are not sure if TinyXML is mishandling the UTF-8 literal or if it is ANTLR. 
We will do more testing to find out.


However, does anybody have suggestions in advance that might explain this? Does ANTLR generating C code in fact handle string literals of UTF-8 in this context? Is there something I must do in order to handle UTF-8?

For your information, the version of ANTLR we are using came from "libantlr3c-3.2.tar". I am not sure if this version handles UTF-8. Again, any advice or insight is appreciated.



Cheers, Mate



More information about the antlr-interest mailing list