[antlr-interest] Unicode input

Tue Feb 8 14:18:40 PST 2011

Im having an issue with parsing an input that contains unicode characters.

This is the code Im using to test the parser (messageBytes is an array
created by reading bytes from a binary file):

private static void parseMessage(byte[] messageBytes) throws IOException{

        ByteArrayInputStream input = new ByteArrayInputStream(messageBytes);
        ANTLRInputStream in = new ANTLRInputStream(input);
        UnitedToteLexer lexer = new UnitedToteLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        UnitedToteParser parser = new UnitedToteParser(tokens);

        try {
            parser.message();

            printHexArray(messageBytes);

        } catch (Exception e){
            // TODO handle unrecognized message formats
            System.out.println("Unrecognized message format");
        }
    }

The main problem I have at the moment is that I get a number of these guys:

line 1:1 no viable alternative at character ''
line 1:2 no viable alternative at character '�'
line 1:3 no viable alternative at character '�'
line 1:4 no viable alternative at character 'x'
line 1:5 no viable alternative at character '?'
...

Essentially, one for each character that is not explicitely defined as a
token in my grammar. Nonetheless, I do have the following rule:

BYTE_VALUE    :    '\u0000'..'\uFFFE';

Which should, if I understand correctly, include all unicode characters.

Now, I understand there was a charVocabulary option in previous versions of
ANTLR to aid with this problem, but it seems it was removed in ANTLR 3.

Was this problem solved in a different way?

[btw my grammar is rather large, Im not sure I should post 400 lines in this
message.]