[antlr-interest] Scanning extended ASCII characters in ANTLR v3

Wed Jan 16 09:03:19 PST 2008

Hello,

I'm constructing a parser for a language that allows arbitrary length 
byte sequences to be embedded within a well formed text message. The 
relevant lexical rules defined within the language specification 
document are:

Digit = [ "0" – "9" ].
ByteSequence = "#" Digit+ "\"" <byte sequence>.

Where # signifies the beginning of a byte sequence header, Digit+ 
signifies the number of bytes to follow in the byte sequence, and " 
signifies the end of the header and the beginning of the actual byte 
sequence data. Note that bytes in the sequence can fall anywhere within 
the extended ASCII character set (i.e. from 0x00 to 0xFF). I've defined 
an ANTLR grammar that works as expected as long as the byte sequence 
stays in the 7-bit ASCII range (i.e. 0x00 to 0x7F). However, when bytes 
in the 8-bit ASCII range are scanned (0x80 to 0xFF), the integer value 
returned by LA(1) is always incorrectly reported as 65533 (0xFFFD). I 
recall that ANLTR v2 had a charVocabulary option where one could set the 
valid input character set using something like charVocabulary = 
'\0'..'\377' (octal), but this doesn't seem to be supported in ANTLR v3.

How can I get my lexer to accept characters in the (0x80 to 0xFF) range? 
The relevant parts of my ANTLR grammar follow:

DIGIT: '0'..'9';
BYTE_SEQUENCE
: '#' DIGIT+ '"'
{
int numBytes = Integer.parseInt($INTEGER.text);
System.out.println("number of bytes = " + numBytes);
readBytes(numBytes);
}
;

@lexer::members {
private void readBytes(int numBytes) {
for (int i=0; i<numBytes; ++i) {
int value = input.LA(1);
input.consume();
System.out.println("\tvalue = " + value);
}
}
}

Thanks much,
Cameron.