[antlr-interest] Beginner lexing question.

Sun Aug 3 10:19:45 PDT 2008

I'm building a parser for a C-like language and I've encountered an 
issue that I think has something to do with the order in which ANTLR 
tries to match rules. This situation is this...

In my expression grammar I have a rule

unary_expression
    :   ... various irrelevant alternatives
    |   UNARY_OPERATOR cast_expression;

Where near the bottom of the grammar file I have

UNARY_OPERATOR
    :   ('&' | '*' | '+' | '-' | '~' | '!');

Now when I try to parse '*X' I get a NoViableAltException. However, if I 
replace UNARY_OPERATOR in the unary_expression rule with an explicit 
'*', things work (well... not the other unary operators, of course). 
That is:

unary_expression
    :   ... various irrelevant alternatives
    |   '*' cast_expression;

I have explicit mention of '*' elsewhere in my grammar (in the rule for 
multiplicative expressions) so I thought that perhaps the lexer was 
seeing a '*' on the input and returning the token used in the multiply 
rule instead of a UNARY_OPERATOR token. Note that the multiply rule 
appears above the definition of UNARY_OPERATOR in my grammar file.

However, if I change the definition of UNARY_OPERATOR to just

UNARY_OPERATOR
    :   '*';

It works! I'm at a loss to understand why including additional 
alternatives for UNARY_OPERATOR would cause a problem during the parse 
of '*X'. As a final test I put all the necessary alternatives directly 
in the unary_expression rule like this:

unary_expression
    :   ... various irrelevant alternatives
    |   ('&' | '*' | '+' | '-' | '~' | '!') cast_expression;

This works fine as well (now I get a warning about the UNARY_OPERATOR 
token definition being unreachable, but I understand that). In short 
there is something about the way the lexer rules work that I'm not 
getting. I'm hoping someone here might be able to shed some light on 
this behavior.

Thanks in advance!

Peter