[antlr-interest] C target: lexer rule precedence confusion

Mon Jun 4 13:41:51 PDT 2007

This may be related to some other thing I am looking at to do with the \
processing when the strings are passed to the C target. Give me a day or
so.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Wincent Colaiuta
> Sent: Monday, June 04, 2007 8:38 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] C target: lexer rule precedence confusion
> 
> I'm trying to write a parser for wiki markup and have discovered some
> puzzling behaviour in the C target. Here is a reduced test grammar:
> 
> grammar WikiText;
> options {
>    language = C;
> }
> 
> wikitext: .* EOF {printf("parser processed all tokens\n");};
> STRONG: '\'\'\'' {printf("STRONG scanned: '\%s'\n", GETTEXT()-
>  >chars); };
> DEFAULT : . { printf("DEFAULT scanned: '\%s'\n", GETTEXT()->chars); };
> 
> The DEFAULT rule is last because I want it to serve as a "catch all"
> for any characters which don't get matched by any other rules. Given
> the input '''foobar''' the lexer/parser print:
> 
> STRONG scanned: '''
> ./input(-4611699882581819391) : error 1 : Unexpected character at
> offset 0, near '''
> DEFAULT scanned: '''''
> DEFAULT scanned: 'f'
> DEFAULT scanned: 'o'
> DEFAULT scanned: 'o'
> DEFAULT scanned: 'b'
> DEFAULT scanned: 'a'
> DEFAULT scanned: 'r'
> STRONG scanned: '''
> ./input(-4611699882581819391) : error 1 : Unexpected character at
> offset 9, near '''
> DEFAULT scanned: '''''
> parser processed all tokens
> 
> I don't really understand the cause of those error messages, but much
> more puzzling is the following: note that the STRONG token is
> recognized, but then recognized all over again as a DEFAULT token. I
> added some additional logging, and saw that the ''' markers are
> indeed being sent to the as the DEFAULT type (-5), the same as the
> other letters.
> 
> Now the same grammar in Java:
> 
> grammar WikiText;
> wikitext: .* EOF {System.out.println("parser processed all tokens");};
> STRONG: '\'\'\'' {System.out.println("STRONG scanned"); };
> DEFAULT: . {System.out.println("DEFAULT scanned"); };
> 
> When running under the ANTLRWorks debugger prints out:
> 
> STRONG scanned
> DEFAULT scanned
> DEFAULT scanned
> DEFAULT scanned
> DEFAULT scanned
> DEFAULT scanned
> DEFAULT scanned
> STRONG scanned
> parser processed all tokens
> 
> Note that in the Java case the ''' is recognized correctly as STRONG,
> and the lexer then moves on. I didn't print the literal value of the
> tokens because I don't know Java and couldn't find any examples of
> how to do it; but you can see that six non-STRONG characters are
> recognized.
> 
> Can anyone explain this difference between the two language targets?
> Or perhaps point out an elementary mistake I am making which is
> causing this?
> 
> This is using 3.0 and a main.c file that just like the ones in the
> examples.
> 
> Cheers,
> Wincent