[antlr-interest] ANTLR3 C Target: Unicode characters support.

Thu Apr 12 02:17:45 PDT 2012

Hi Jim,

Thanks for your response. Sorry, I'm not sure what you mean by "surrogate pairs are not specified and the input stream handles them".

The '\unnnn' syntax works fine for characters in the BMP, but what about characters in the supplementary planes, which can't be expressed using 4 hex digits?

Elsewhere on this mailing list, it's been suggested that such characters should be encoded in the grammar file as surrogate pairs. For example, the unicode codepoint U+2008A would be expressed as '\uD840\uDC8A'. This appears to work OK when using the Java target, but when I use the C target I get a lexer error. I'm using UTF-8 as the input encoding and used the appropriate flag when creating the input stream.

In theory, should this work or I am doing something obviously wrong? I can provide my test program, if required.

Regards,

Khutsafalo

________________________________
 From: Jim Idle <jimi at temporal-wave.com>
To: antlr-interest at antlr.org 
Sent: Wednesday, 11 April 2012, 16:02
Subject: Re: [antlr-interest] ANTLR3 C Target: Unicode characters support.

There are no limitations, but I think you are confusing the term Unicode
with Encoding. The lexer uses a 32 bit integer code point and can
therefore store and match the entire Unicode code-point range.

However, when you create the input stream, you must specify what the input
stream encoding is (as of version 3.4). This can be latin1, UTF8, UTF16,
UTF32 or EBCIDIC and it will auto-detect and compensate for the BOM. See
the code for the inputstream for details.

To specify a character, use '\unnnn' where nnnn is the hex code point.
Note that surrogate pairs are not specified and the input stream handles
them (see the relevant versions of the LA() routines in the source code
for details.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Khutsafalo Jabari
> Sent: Wednesday, April 11, 2012 3:48 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] ANTLR3 C Target: Unicode characters support.
>
> Hi all,
>
> Using ANTLR3 C target, I am writing a parser that accepts Unicode
> characters for both identifiers and string literals. What are ANTLR's
> limitations for characters outside the Unicode basic multilingual plane
> i.e. Unicode supplementary planes? Also, how do I specify these
> characters in the lexer?
>
>
> Thanks in advance.
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address