[antlr-interest] C target: lexer rule precedence confusion

Wincent Colaiuta win at wincent.com
Mon Jun 4 08:38:17 PDT 2007


I'm trying to write a parser for wiki markup and have discovered some  
puzzling behaviour in the C target. Here is a reduced test grammar:

grammar WikiText;
options {
   language = C;	
}

wikitext: .* EOF {printf("parser processed all tokens\n");};
STRONG: '\'\'\'' {printf("STRONG scanned: '\%s'\n", GETTEXT()- 
 >chars); };
DEFAULT : . { printf("DEFAULT scanned: '\%s'\n", GETTEXT()->chars); };

The DEFAULT rule is last because I want it to serve as a "catch all"  
for any characters which don't get matched by any other rules. Given  
the input '''foobar''' the lexer/parser print:

STRONG scanned: '''
./input(-4611699882581819391) : error 1 : Unexpected character at  
offset 0, near '''
DEFAULT scanned: '''''
DEFAULT scanned: 'f'
DEFAULT scanned: 'o'
DEFAULT scanned: 'o'
DEFAULT scanned: 'b'
DEFAULT scanned: 'a'
DEFAULT scanned: 'r'
STRONG scanned: '''
./input(-4611699882581819391) : error 1 : Unexpected character at  
offset 9, near '''
DEFAULT scanned: '''''
parser processed all tokens

I don't really understand the cause of those error messages, but much  
more puzzling is the following: note that the STRONG token is  
recognized, but then recognized all over again as a DEFAULT token. I  
added some additional logging, and saw that the ''' markers are  
indeed being sent to the as the DEFAULT type (-5), the same as the  
other letters.

Now the same grammar in Java:

grammar WikiText;
wikitext: .* EOF {System.out.println("parser processed all tokens");};
STRONG: '\'\'\'' {System.out.println("STRONG scanned"); };
DEFAULT: . {System.out.println("DEFAULT scanned"); };

When running under the ANTLRWorks debugger prints out:

STRONG scanned
DEFAULT scanned
DEFAULT scanned
DEFAULT scanned
DEFAULT scanned
DEFAULT scanned
DEFAULT scanned
STRONG scanned
parser processed all tokens

Note that in the Java case the ''' is recognized correctly as STRONG,  
and the lexer then moves on. I didn't print the literal value of the  
tokens because I don't know Java and couldn't find any examples of  
how to do it; but you can see that six non-STRONG characters are  
recognized.

Can anyone explain this difference between the two language targets?  
Or perhaps point out an elementary mistake I am making which is  
causing this?

This is using 3.0 and a main.c file that just like the ones in the  
examples.

Cheers,
Wincent



More information about the antlr-interest mailing list