[antlr-interest] Unicode input

Wed Feb 9 06:34:30 PST 2011

I just realized, I'm also getting this error:

line 1:0 mismatched input '0' expecting BYTE_VALUE

Where the following rule exists within my grammar:

BYTE_VALUE    :    '\u0000'..'\uFFFE';

Which in my understanding should match any Unicode character that can be
represented with the UTF8 encoding.

The question is: why is the character 0 (Digit zero) not matching the
BYTE_VALUE rule.

I have verified that the first character of the input is 0 ('\u0030').

Any clues?

On Tue, Feb 8, 2011 at 5:18 PM, Alex Lujan <alex at apption.com> wrote:

> Im having an issue with parsing an input that contains unicode characters.
>
> This is the code Im using to test the parser (messageBytes is an array
> created by reading bytes from a binary file):
>
> private static void parseMessage(byte[] messageBytes) throws IOException{
>
>         ByteArrayInputStream input = new
> ByteArrayInputStream(messageBytes);
>         ANTLRInputStream in = new ANTLRInputStream(input);
>         UnitedToteLexer lexer = new UnitedToteLexer(in);
>         CommonTokenStream tokens = new CommonTokenStream(lexer);
>         UnitedToteParser parser = new UnitedToteParser(tokens);
>
>
>         try {
>             parser.message();
>
>             printHexArray(messageBytes);
>
>         } catch (Exception e){
>             // TODO handle unrecognized message formats
>             System.out.println("Unrecognized message format");
>         }
>     }
>
> The main problem I have at the moment is that I get a number of these guys:
>
> line 1:1 no viable alternative at character ' '
> line 1:2 no viable alternative at character '�'
> line 1:3 no viable alternative at character '�'
> line 1:4 no viable alternative at character 'x'
> line 1:5 no viable alternative at character '?'
> ...
>
> Essentially, one for each character that is not explicitely defined as a
> token in my grammar. Nonetheless, I do have the following rule:
>
> BYTE_VALUE    :    '\u0000'..'\uFFFE';
>
> Which should, if I understand correctly, include all unicode characters.
>
> Now, I understand there was a charVocabulary option in previous versions of
> ANTLR to aid with this problem, but it seems it was removed in ANTLR 3.
>
> Was this problem solved in a different way?
>
> [btw my grammar is rather large, Im not sure I should post 400 lines in
> this message.]
>
>

-- 
Alejandro Lujan
Apption Software
(613) 725 62 68 x625