[antlr-interest] Encoding Unicode code points in a grammar file

Jim Idle jimi at temporal-wave.com
Fri Apr 20 11:00:43 PDT 2012


Did you try just '\uNNNNN' - are you seeing an error when you do this?

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Ross Freemantle
> Sent: Friday, April 20, 2012 7:46 AM
> To: ANTLR interest
> Subject: [antlr-interest] Encoding Unicode code points in a grammar
> file
>
> Hi,
>
> I'm trying to use the C target to write a parser for a simple language
> that permits Unicode characters in identifiers and string literals.
>
> The '\uXXXX' escape sequence works fine for characters in the Basic
> Multilingual Plane, but isn't suitable for anything beyond it. Ideally,
> I need a way of encoding the actual code point number into the grammar
> file. This quote from the C FAQ suggests it can be done:
>
> "The purpose of LA() is to return the 32 bit integer Unicode code point
> for the specified character – how it does that is irrelevant to the
> lexer, which is just matching 32 bit numbers. This means you should not
> code your lexer to match surrogates, just the code points."
>
> I haven't been able to find any documentation or code examples to
> support this, however. Is this actually possible, or am I barking up
> the wrong tree?
>
> Thanks in advance,
>
> Ross Freemantle
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address


More information about the antlr-interest mailing list