[antlr-interest] Changes to C runtime for 3.4

Wed Jul 20 13:37:56 PDT 2011

Hi Jim,

On 6/24/2011 12:42 PM, Jim Idle wrote:
> Please note that the documentation for the C runtime in 3.4 is yet to be
> updated. In the meantime, if you wish to try it, then there is one change
> that you need to be aware of:
>
>
>
> 1)      The distinction between ASCII and UCS2 input streams is now removed
> and there is a single function: antlr3FileStreamNew() to replace the file
> related input streams and a function” antlr3StringStreamNew to replace the
> memory related input streams. Prototypes and usage:
>
>
>
>
>
> antlr3FileStreamNew(pANTLR3_UINT8 fileName, ANTLR3_UINT32 encoding)
>
> antlr3StringStreamNew(pANTLR3_UINT8 data, ANTLR3_UINT32 encoding,
> ANTLR3_UINT32 size, pANTLR3_UINT8 name)
>
>
>
> fileName – path to input file in 8 bit characters. Used to call fopen()
>
> data – pointer to input data in any encoded form (note that I will change
> this to void * in the next beta/release)
>
> size – the size of the input data (always bytesm regardless of encoding)
>
> name – The name to use for the string stream (passed to error handlers for
> instance) may be NULL
It looks like the name argument cannot be NULL. I tried this and it
promptly crashed. It looks like the access violation occurs in the
strlen() function within the newStr8() function. If I pass in any old
string, it works of course. I have no use for this name, so I'd like to
pass NULL. Is this a bug, or should I just be passing an empty string
instead? I'm using ANTLR3_ENC_8BIT, if that matters.

Thanks,

- Justin

>
>
> Then the encoding values are:
>
>
>
> ANTLR3_ENC_8BIT    – 8 bit encoding (ASCII/latin1/etc) (replaces the
> existing ASCII stream)
>
> ANTLR3_ENC_UTF8    – UTF8 encoding  (eats any BOM that may be present)
>
> ANTLR3_ENC_UTF16   – UTF16 encoding (also handles UCS2) – determine byte
> order from BOM or machine natural without BOM
>
> ANTLR3_ENC_UTF16BE – UTF16 encoding (also handles UCS2), big endian but no
> BOM
>
> ANTLR3_ENC_UTF16LE – UTF16 encoding (also handles UCS2), little endian but
> no BOM
>
> ANTLR3_ENC_UTF32   - UTF32 encoding – determine byte order from BOM or
> machine natural without BOM
>
> ANTLR3_ENC_UTF32BE - UTF32 encoding – big endian but no BOM
>
> ANTLR3_ENC_UTF32LE - UTF32 encoding – little endian but no BOM
>
> ANTLR3_ENC_EBCDIC  - EBCDIC encoding (8 bit).
>
>
>
> Note that EBCDIC encoding means that the input is in EBCDIC and it is not
> changed. The LA() method for EBCDIC encoding converts a character to ASCII
> before matching. Therefore the pointers to the first character of the token
> in the input stream remain pointing at EBCDIC and you are responsible for
> any conversion of the token strings if you need to convert them.
>
>
>
> Encoding is as per the Unicode standards and supports the full Unicode
> character range and all surrogate pairs are decoded in UTF16. Note however
> that for performance reasons, errors in the encoding are usually ignored
> (for instance a valid hi surrogate that does not have a lo surrogate), but
> that invalid sequences that may not be ignored, may screw up your input. You
> can of course override any of the LA methods and report such things as
> errors, should you need to do so. The purpose of LA() is to return the 32
> bit integer Unicode code point for the specified character – how it does
> that is irrelevant to the lexer, which is just matching 32 but numbers. This
> means you should not code your lexer to match surrogates, just the code
> points.
>
>
>
> Jim
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address