[antlr-interest] Encoding Unicode code points in a grammar file

Fri Apr 20 11:16:15 PDT 2012

I see that the current code is assuming the \unnnn format proscribed by the
Java spec. We are going to have to implement something extra such as
\xnnnnn; where the ; indicates the end of the sequence. Then the
supplementary sets can be coded for. This has to be done in CTarget.java in
the Java toolset. It is pretty easy, I will see if I can do this and a few
other things at the weekend (you can always do it yourself too of course).

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Ross Freemantle
> Sent: Friday, April 20, 2012 7:46 AM
> To: ANTLR interest
> Subject: [antlr-interest] Encoding Unicode code points in a grammar
> file
>
> Hi,
>
> I'm trying to use the C target to write a parser for a simple language
> that permits Unicode characters in identifiers and string literals.
>
> The '\uXXXX' escape sequence works fine for characters in the Basic
> Multilingual Plane, but isn't suitable for anything beyond it. Ideally,
> I need a way of encoding the actual code point number into the grammar
> file. This quote from the C FAQ suggests it can be done:
>
> "The purpose of LA() is to return the 32 bit integer Unicode code point
> for the specified character – how it does that is irrelevant to the
> lexer, which is just matching 32 bit numbers. This means you should not
> code your lexer to match surrogates, just the code points."
>
> I haven't been able to find any documentation or code examples to
> support this, however. Is this actually possible, or am I barking up
> the wrong tree?
>
> Thanks in advance,
>
> Ross Freemantle
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address