[antlr-interest] Changes to C runtime for 3.4

Fri Jun 24 09:42:11 PDT 2011

Please note that the documentation for the C runtime in 3.4 is yet to be
updated. In the meantime, if you wish to try it, then there is one change
that you need to be aware of:

1)      The distinction between ASCII and UCS2 input streams is now removed
and there is a single function: antlr3FileStreamNew() to replace the file
related input streams and a function” antlr3StringStreamNew to replace the
memory related input streams. Prototypes and usage:

antlr3FileStreamNew(pANTLR3_UINT8 fileName, ANTLR3_UINT32 encoding)

antlr3StringStreamNew(pANTLR3_UINT8 data, ANTLR3_UINT32 encoding,
ANTLR3_UINT32 size, pANTLR3_UINT8 name)

fileName – path to input file in 8 bit characters. Used to call fopen()

data – pointer to input data in any encoded form (note that I will change
this to void * in the next beta/release)

size – the size of the input data (always bytesm regardless of encoding)

name – The name to use for the string stream (passed to error handlers for
instance) may be NULL

Then the encoding values are:

ANTLR3_ENC_8BIT    – 8 bit encoding (ASCII/latin1/etc) (replaces the
existing ASCII stream)

ANTLR3_ENC_UTF8    – UTF8 encoding  (eats any BOM that may be present)

ANTLR3_ENC_UTF16   – UTF16 encoding (also handles UCS2) – determine byte
order from BOM or machine natural without BOM

ANTLR3_ENC_UTF16BE – UTF16 encoding (also handles UCS2), big endian but no
BOM

ANTLR3_ENC_UTF16LE – UTF16 encoding (also handles UCS2), little endian but
no BOM

ANTLR3_ENC_UTF32   - UTF32 encoding – determine byte order from BOM or
machine natural without BOM

ANTLR3_ENC_UTF32BE - UTF32 encoding – big endian but no BOM

ANTLR3_ENC_UTF32LE - UTF32 encoding – little endian but no BOM

ANTLR3_ENC_EBCDIC  - EBCDIC encoding (8 bit).

Note that EBCDIC encoding means that the input is in EBCDIC and it is not
changed. The LA() method for EBCDIC encoding converts a character to ASCII
before matching. Therefore the pointers to the first character of the token
in the input stream remain pointing at EBCDIC and you are responsible for
any conversion of the token strings if you need to convert them.

Encoding is as per the Unicode standards and supports the full Unicode
character range and all surrogate pairs are decoded in UTF16. Note however
that for performance reasons, errors in the encoding are usually ignored
(for instance a valid hi surrogate that does not have a lo surrogate), but
that invalid sequences that may not be ignored, may screw up your input. You
can of course override any of the LA methods and report such things as
errors, should you need to do so. The purpose of LA() is to return the 32
bit integer Unicode code point for the specified character – how it does
that is irrelevant to the lexer, which is just matching 32 but numbers. This
means you should not code your lexer to match surrogates, just the code
points.

Jim