[antlr-interest] antlr-interest Digest, Vol 38, Issue 52

Wed Jan 16 12:53:31 PST 2008

Works like a charm... thanks!

Cameron.

J Chapman Flack wrote:
>
>> From: Cameron Ross <cross at symboticware.com>
>> Digit = [ "0" ? "9" ].
>> ByteSequence = "#" Digit+ "\"" <byte sequence>.
>>
>> Where # signifies the beginning of a byte sequence header, Digit+ 
>> signifies the number of bytes to follow in the byte sequence, and " 
>> signifies the end of the header and the beginning of the actual byte 
>> sequence data. Note that bytes in the sequence can fall anywhere 
>> within the extended ASCII character set (i.e. from 0x00 to 0xFF).
>> ... However, when bytes
>> in the 8-bit ASCII range are scanned (0x80 to 0xFF), the integer 
>> value returned by LA(1) is always incorrectly reported as 65533 
>> (0xFFFD). I 
>
> "extended ASCII" and "8-bit ASCII" are terms that aren't based in
> any standard and so they mostly get in the way of seeing what's going
> on here.  ASCII is a 7-bit code that uses the values 0x00 - 0x7F,
> exclusively.
>
> I don't see your code for providing the input to the ANTLR lexer,
> but it seems likely that it's coming as a Java character stream
> (i.e. there is a java.io.Reader involved).  The Reader's job is
> to convert external, byte-oriented character representations into
> Unicode characters. ANTLR folks can easily picture a Reader
> as a kind of pre-lexer that consumes bytes and emits chars.
> Its "lexer grammar" is determined by a character set name,
> which you can pass to an InputStreamReader constructor, for
> example.
>
> \uFFFD, the Unicode "REPLACEMENT CHARACTER" (I'm not shouting,
> Unicode official char names are always in caps) is exactly what
> a Reader is expected to emit if it hits a byte sequence that
> violates the lexical grammar of its character set.  If you had
> a Reader that expected ASCII input, the results you're seeing
> are what you would get.
>
> To get the behavior that you seem to be looking for, you could
> create a Reader for the character set "iso-8859_1". This character
> set just happens to consist of exactly the one-byte sequences
> 0x00 - 0xff which it maps directly into the chars \u0000 - \u00ff.
>
> Some time way back in ANTLR 2 days I remember thinking how nice
> it would be if you could explicitly define a byte-oriented
> rather than char-oriented lexer (i.e. ANTLR could have both
> CharLexer and ByteLexer classes derived from a common
> base recognizer class that did almost all the work, but you
> would ask it to generate a byte lexer when your language
> specification wasn't Unicode-based, and completely avoid
> jumping through Unicode-related hoops).  I haven't yet caught
> up enough with ANTLR 3 to see whether that's possible, but if
> not the best bet is just to treat your input explicitly as
> iso-8859_1-encoded characters.
>
> -Chap
>