[antlr-interest] ANTLR C: Question regarding the portability of generated lexer C code

Thu Oct 15 07:56:44 PDT 2009

Jim, thanks for your response ...

I know that in the EBCDIC system we feed a Unicode stream into the lexer,
thus I'm pretty sure when the generated lexer code I pasted before is
executed, it is already operating on the 32-bit unicode stream.

The problem is more about the native C compilation in an EBCDIC system like
IBM z/OS mainframe.

To see if a character from the Unicode stream is an 'A', we have to compare
with a value 0x0041 ... If we match it with a native 'A' in the code, this
will not be a match in an EBCDIC C compilation.

Best,
-Lego

On Fri, Oct 16, 2009 at 3:07 AM, Jim Idle <jimi at temporal-wave.com> wrote:

>  ANTLR works internally with 32 bit Unicode (UTF32), not EBCDIC, even if
> it is in 8 bit mode. So you need to convert the EBCDIC to Unicode 8 bits and
> use the ‘ASCII’ input stream. A simple way to do this would be to write your
> own EBCDIC input stream that just converted to Unicode code points
> (essentially EBCDIC->ASCII) on the fly via a lookup table. Trivial and
> should be pretty quick.
>
>
>
> Jim
>
>
>
> *From:* antlr-interest-bounces at antlr.org [mailto:
> antlr-interest-bounces at antlr.org] *On Behalf Of *Lego Haryanto
> *Sent:* Tuesday, October 13, 2009 3:51 AM
> *To:* antlr-interest at antlr.org
> *Subject:* [antlr-interest] ANTLR C: Question regarding the portability of
> generated lexer C code
>
>
>
> I just recently noticed that the generated code from my lexer grammar
> contains something like the following snippet:
>
>             .
>             .
>             else if ( (((LA17_0 >= 'A') && (LA17_0 <= 'Z'))) )
>             {
>                 alt17=2;
>             }
>             else if ( (((LA17_0 >= 'a') && (LA17_0 <= 'z'))) )
>             {
>                 alt17=3;
>             }
>             else if ( (((LA17_0 >= 0x00A0) && (LA17_0 <= 0xD7FF))) )
>             {
>                 alt17=4;
>             }
>             .
>             .
>
> The generated code seems to comfortably use 'A' ... 'Z' literals.  This may
> not be good if let's say I compile the generated code in an IBM z/OS EBCDIC
> environment as ['A' .. 'Z'] range contains more than just the 26 alphabet
> codes and the value of the codes are not the same as the ones in Unicode
> character set.
>
> I'm expecting something like in the third expression where 'A' is written
> explicitly as 0x0041 (Unicode for 'A').
>
> Please confirm.
>
> -Lego
>

-- 
Fear of the LORD is the beginning of knowledge (Proverbs 1:7)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20091015/455f4ea2/attachment.html