[antlr-interest] Encoding Unicode code points in a grammar file

Fri Apr 20 07:46:28 PDT 2012

Hi,

I'm trying to use the C target to write a parser for a simple language that permits Unicode characters in identifiers and string literals.

The '\uXXXX' escape sequence works fine for characters in the Basic Multilingual Plane, but isn't suitable for anything beyond it. Ideally, I need a way of encoding the actual code point number into the grammar file. This quote from the C FAQ suggests it can be done:

"The purpose of LA() is to return the 32 bit integer Unicode code point for the specified character – how it does that is irrelevant to the lexer, which is just matching 32 bit numbers. This means you should not code your lexer to match surrogates, just the code points."

I haven't been able to find any documentation or code examples to support this, however. Is this actually possible, or am I barking up the wrong tree?

Thanks in advance,

Ross Freemantle