[antlr-interest] We Need UTF-8 String Literal Support for C Parser/Lexer

Tue Jul 17 14:44:18 PDT 2012

Hello Folks

I have a basic expression grammar that specifies the production of abstract syntax trees (AST) of:

column (this represents a database column)

functions

Boolean: "and", "or", "not" 

Equality/Relational: "=", "!=", ">", etc
Arithmetic: "+", "-", etc

literals

int, float
string

In our case, we have an expression of the following form, taken from a tag in an XML document using the TinyXML C API. 

    col1 = "UTF-8 string"

The AST looks as so, as might be expected:

relational node, "=" function
    child node 1, column
    child node 2, literal string

Unfortunately, the literal string in child node 2 is incorrectly a 4 byte string, when in the original UTF-8 it is 6 bytes. We are not sure if TinyXML is mishandling the UTF-8 literal or if it is ANTLR. 
We will do more testing to find out.

However, does anybody have suggestions in advance that might explain this? Does ANTLR generating C code in fact handle string literals of UTF-8 in this context? Is there something I must do in order to handle UTF-8?

For your information, the version of ANTLR we are using came from "libantlr3c-3.2.tar". I am not sure if this version handles UTF-8. Again, any advice or insight is appreciated.

Cheers, Mate