[antlr-interest] ANTLR C: Question regarding the portability of generated lexer C code

Sat Oct 17 20:42:05 PDT 2009

However, I am pretty sure that all the C compilers on such systems allow specification of ASCII assumptions rather than the stupid EBCDIC (designed by committee to be stupid). For instance I know that the zOS compiler allows this. EBCDIC is a ridiculous encoding, which I won’t be supporting directly I am afraid.

So, compile the code with ASCII assumptions and feed the EBCDIC as 8 bit Unicode and you will be fine.

Jim

From: Lego Haryanto [mailto:legoharyanto at gmail.com] 
Sent: Thursday, October 15, 2009 8:27 PM
To: Jim Idle
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] ANTLR C: Question regarding the portability of generated lexer C code

Jim, thanks for your response ...

I know that in the EBCDIC system we feed a Unicode stream into the lexer, thus I'm pretty sure when the generated lexer code I pasted before is executed, it is already operating on the 32-bit unicode stream.

The problem is more about the native C compilation in an EBCDIC system like IBM z/OS mainframe.

To see if a character from the Unicode stream is an 'A', we have to compare with a value 0x0041 ... If we match it with a native 'A' in the code, this will not be a match in an EBCDIC C compilation.

Best,
-Lego

On Fri, Oct 16, 2009 at 3:07 AM, Jim Idle <jimi at temporal-wave.com> wrote:

ANTLR works internally with 32 bit Unicode (UTF32), not EBCDIC, even if it is in 8 bit mode. So you need to convert the EBCDIC to Unicode 8 bits and use the ‘ASCII’ input stream. A simple way to do this would be to write your own EBCDIC input stream that just converted to Unicode code points (essentially EBCDIC->ASCII) on the fly via a lookup table. Trivial and should be pretty quick.

Jim

From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Lego Haryanto
Sent: Tuesday, October 13, 2009 3:51 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] ANTLR C: Question regarding the portability of generated lexer C code

I just recently noticed that the generated code from my lexer grammar contains something like the following snippet:

            .
            .
            else if ( (((LA17_0 >= 'A') && (LA17_0 <= 'Z'))) ) 
            {
                alt17=2;
            }
            else if ( (((LA17_0 >= 'a') && (LA17_0 <= 'z'))) ) 
            {
                alt17=3;
            }
            else if ( (((LA17_0 >= 0x00A0) && (LA17_0 <= 0xD7FF))) ) 
            {
                alt17=4;
            }
            .
            .

The generated code seems to comfortably use 'A' ... 'Z' literals.  This may not be good if let's say I compile the generated code in an IBM z/OS EBCDIC environment as ['A' .. 'Z'] range contains more than just the 26 alphabet codes and the value of the codes are not the same as the ones in Unicode character set.

I'm expecting something like in the third expression where 'A' is written explicitly as 0x0041 (Unicode for 'A').

Please confirm.

-Lego

-- 
Fear of the LORD is the beginning of knowledge (Proverbs 1:7)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20091018/c62d3b58/attachment.html