[antlr-interest] ANTLR3 C Target: Unicode characters support.

Jim Idle jimi at temporal-wave.com
Wed Apr 11 08:02:19 PDT 2012

There are no limitations, but I think you are confusing the term Unicode
with Encoding. The lexer uses a 32 bit integer code point and can
therefore store and match the entire Unicode code-point range.

However, when you create the input stream, you must specify what the input
stream encoding is (as of version 3.4). This can be latin1, UTF8, UTF16,
UTF32 or EBCIDIC and it will auto-detect and compensate for the BOM. See
the code for the inputstream for details.

To specify a character, use '\unnnn' where nnnn is the hex code point.
Note that surrogate pairs are not specified and the input stream handles
them (see the relevant versions of the LA() routines in the source code
for details.


> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Khutsafalo Jabari
> Sent: Wednesday, April 11, 2012 3:48 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] ANTLR3 C Target: Unicode characters support.
> Hi all,
> Using ANTLR3 C target, I am writing a parser that accepts Unicode
> characters for both identifiers and string literals. What are ANTLR's
> limitations for characters outside the Unicode basic multilingual plane
> i.e. Unicode supplementary planes? Also, how do I specify these
> characters in the lexer?
> Thanks in advance.
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address

More information about the antlr-interest mailing list