[antlr-interest] Scanning extended ASCII characters in ANTLR v3
Cameron Ross
cross at symboticware.com
Wed Jan 16 09:03:19 PST 2008
Hello,
I'm constructing a parser for a language that allows arbitrary length
byte sequences to be embedded within a well formed text message. The
relevant lexical rules defined within the language specification
document are:
Digit = [ "0" – "9" ].
ByteSequence = "#" Digit+ "\"" <byte sequence>.
Where # signifies the beginning of a byte sequence header, Digit+
signifies the number of bytes to follow in the byte sequence, and "
signifies the end of the header and the beginning of the actual byte
sequence data. Note that bytes in the sequence can fall anywhere within
the extended ASCII character set (i.e. from 0x00 to 0xFF). I've defined
an ANTLR grammar that works as expected as long as the byte sequence
stays in the 7-bit ASCII range (i.e. 0x00 to 0x7F). However, when bytes
in the 8-bit ASCII range are scanned (0x80 to 0xFF), the integer value
returned by LA(1) is always incorrectly reported as 65533 (0xFFFD). I
recall that ANLTR v2 had a charVocabulary option where one could set the
valid input character set using something like charVocabulary =
'\0'..'\377' (octal), but this doesn't seem to be supported in ANTLR v3.
How can I get my lexer to accept characters in the (0x80 to 0xFF) range?
The relevant parts of my ANTLR grammar follow:
DIGIT: '0'..'9';
BYTE_SEQUENCE
: '#' DIGIT+ '"'
{
int numBytes = Integer.parseInt($INTEGER.text);
System.out.println("number of bytes = " + numBytes);
readBytes(numBytes);
}
;
@lexer::members {
private void readBytes(int numBytes) {
for (int i=0; i<numBytes; ++i) {
int value = input.LA(1);
input.consume();
System.out.println("\tvalue = " + value);
}
}
}
Thanks much,
Cameron.
More information about the antlr-interest
mailing list