[antlr-interest] We Need UTF-8 String Literal Support for C Parser/Lexer
banmate6 at aol.com
banmate6 at aol.com
Tue Jul 17 14:44:18 PDT 2012
Hello Folks
I have a basic expression grammar that specifies the production of abstract syntax trees (AST) of:
column (this represents a database column)
functions
Boolean: "and", "or", "not"
Equality/Relational: "=", "!=", ">", etc
Arithmetic: "+", "-", etc
literals
int, float
string
In our case, we have an expression of the following form, taken from a tag in an XML document using the TinyXML C API.
col1 = "UTF-8 string"
The AST looks as so, as might be expected:
relational node, "=" function
child node 1, column
child node 2, literal string
Unfortunately, the literal string in child node 2 is incorrectly a 4 byte string, when in the original UTF-8 it is 6 bytes. We are not sure if TinyXML is mishandling the UTF-8 literal or if it is ANTLR.
We will do more testing to find out.
However, does anybody have suggestions in advance that might explain this? Does ANTLR generating C code in fact handle string literals of UTF-8 in this context? Is there something I must do in order to handle UTF-8?
For your information, the version of ANTLR we are using came from "libantlr3c-3.2.tar". I am not sure if this version handles UTF-8. Again, any advice or insight is appreciated.
Cheers, Mate
More information about the antlr-interest
mailing list