[antlr-interest] C target: lexer rule precedence confusion
Wincent Colaiuta
win at wincent.com
Mon Jun 4 08:38:17 PDT 2007
I'm trying to write a parser for wiki markup and have discovered some
puzzling behaviour in the C target. Here is a reduced test grammar:
grammar WikiText;
options {
language = C;
}
wikitext: .* EOF {printf("parser processed all tokens\n");};
STRONG: '\'\'\'' {printf("STRONG scanned: '\%s'\n", GETTEXT()-
>chars); };
DEFAULT : . { printf("DEFAULT scanned: '\%s'\n", GETTEXT()->chars); };
The DEFAULT rule is last because I want it to serve as a "catch all"
for any characters which don't get matched by any other rules. Given
the input '''foobar''' the lexer/parser print:
STRONG scanned: '''
./input(-4611699882581819391) : error 1 : Unexpected character at
offset 0, near '''
DEFAULT scanned: '''''
DEFAULT scanned: 'f'
DEFAULT scanned: 'o'
DEFAULT scanned: 'o'
DEFAULT scanned: 'b'
DEFAULT scanned: 'a'
DEFAULT scanned: 'r'
STRONG scanned: '''
./input(-4611699882581819391) : error 1 : Unexpected character at
offset 9, near '''
DEFAULT scanned: '''''
parser processed all tokens
I don't really understand the cause of those error messages, but much
more puzzling is the following: note that the STRONG token is
recognized, but then recognized all over again as a DEFAULT token. I
added some additional logging, and saw that the ''' markers are
indeed being sent to the as the DEFAULT type (-5), the same as the
other letters.
Now the same grammar in Java:
grammar WikiText;
wikitext: .* EOF {System.out.println("parser processed all tokens");};
STRONG: '\'\'\'' {System.out.println("STRONG scanned"); };
DEFAULT: . {System.out.println("DEFAULT scanned"); };
When running under the ANTLRWorks debugger prints out:
STRONG scanned
DEFAULT scanned
DEFAULT scanned
DEFAULT scanned
DEFAULT scanned
DEFAULT scanned
DEFAULT scanned
STRONG scanned
parser processed all tokens
Note that in the Java case the ''' is recognized correctly as STRONG,
and the lexer then moves on. I didn't print the literal value of the
tokens because I don't know Java and couldn't find any examples of
how to do it; but you can see that six non-STRONG characters are
recognized.
Can anyone explain this difference between the two language targets?
Or perhaps point out an elementary mistake I am making which is
causing this?
This is using 3.0 and a main.c file that just like the ones in the
examples.
Cheers,
Wincent
More information about the antlr-interest
mailing list