[antlr-interest] antlr-interest Digest, Vol 38, Issue 52

Wed Jan 16 12:39:41 PST 2008

> From: Cameron Ross <cross at symboticware.com>
> Digit = [ "0" ? "9" ].
> ByteSequence = "#" Digit+ "\"" <byte sequence>.
> 
> Where # signifies the beginning of a byte sequence header, Digit+ 
> signifies the number of bytes to follow in the byte sequence, and " 
> signifies the end of the header and the beginning of the actual byte 
> sequence data. Note that bytes in the sequence can fall anywhere within 
> the extended ASCII character set (i.e. from 0x00 to 0xFF).
> ... However, when bytes
> in the 8-bit ASCII range are scanned (0x80 to 0xFF), the integer value 
> returned by LA(1) is always incorrectly reported as 65533 (0xFFFD). I 

"extended ASCII" and "8-bit ASCII" are terms that aren't based in
any standard and so they mostly get in the way of seeing what's going
on here.  ASCII is a 7-bit code that uses the values 0x00 - 0x7F,
exclusively.

I don't see your code for providing the input to the ANTLR lexer,
but it seems likely that it's coming as a Java character stream
(i.e. there is a java.io.Reader involved).  The Reader's job is
to convert external, byte-oriented character representations into
Unicode characters. ANTLR folks can easily picture a Reader
as a kind of pre-lexer that consumes bytes and emits chars.
Its "lexer grammar" is determined by a character set name,
which you can pass to an InputStreamReader constructor, for
example.

\uFFFD, the Unicode "REPLACEMENT CHARACTER" (I'm not shouting,
Unicode official char names are always in caps) is exactly what
a Reader is expected to emit if it hits a byte sequence that
violates the lexical grammar of its character set.  If you had
a Reader that expected ASCII input, the results you're seeing
are what you would get.

To get the behavior that you seem to be looking for, you could
create a Reader for the character set "iso-8859_1". This character
set just happens to consist of exactly the one-byte sequences
0x00 - 0xff which it maps directly into the chars \u0000 - \u00ff.

Some time way back in ANTLR 2 days I remember thinking how nice
it would be if you could explicitly define a byte-oriented
rather than char-oriented lexer (i.e. ANTLR could have both
CharLexer and ByteLexer classes derived from a common
base recognizer class that did almost all the work, but you
would ask it to generate a byte lexer when your language
specification wasn't Unicode-based, and completely avoid
jumping through Unicode-related hoops).  I haven't yet caught
up enough with ANTLR 3 to see whether that's possible, but if
not the best bet is just to treat your input explicitly as
iso-8859_1-encoded characters.

-Chap