[antlr-interest] Non-deterministic behaviour in matching lexer tokens

Fri May 27 14:13:02 PDT 2011

Hello folks,

I am baffled on how to get my parser to understand lexer tokens with
overlapping definitions
in the context of other rules.

My AST grammar is defined as:

-----------------------------------------
expression:
    (call)+
    ;

call:
    'call' IDENT
    ;

VALUE:
    (LETTER | DIGIT)+
    ;

IDENT:
    LETTER (LETTER | DIGIT | '_')*
    ;

fragment LETTER:
    ('a'..'z' | 'A'..'Z')
    ;

fragment DIGIT:
    '0'..'9'
    ;

WS:
    (' ' | '\t' | '\n' | '\r'| '\f')+
    {$channel = HIDDEN;}
    ;
-----------------------------------------

This grammar allows me to parse "call my_val" because it does match against
the call rule
and the IDENT token.

But if I try to parse "call myval" it incorrectly matches against VALUE and
throws the following
error:

MismatchedTokenException: line 1:5 mismatched input 'myval' expecting
'\u0004'

If I switch the order of the VALUE and IDENT, so that IDENT is first, then
the following matches
the call rule:

call myval
call my_val

both will match against the IDENT token and things are fine.

My problem is that this solution breaks down once IDENT can no longer
subsume all input.

Consider a modified grammar where the VALUE token must end with a '!'
character and an
"action" rule must match against the VALUE token:

--------------------------------------
expression:
    (call | action)+
    ;

call:
    'call' IDENT
    ;

action:
    'action' VALUE
    ;

IDENT:
    LETTER (LETTER | DIGIT | '_')*
    ;

VALUE:
    (LETTER | DIGIT) '!'+
    ;

fragment LETTER:
    ('a'..'z' | 'A'..'Z')
    ;

fragment DIGIT:
    '0'..'9'
    ;

WS:
    (' ' | '\t' | '\n' | '\r'| '\f')+
    {$channel = HIDDEN;}
    ;
-------------------------------------

If I try to parse the following the last expression will fail:

call MY_VAL      (parses)
call MYVAL        (parses)
action MYVAL!   (MismatchedTokenException: line 3:7 mismatched input 'MYVAL'
expecting '\u0005')

Are there any options in ANTLR to help differentiate IDENT from VALUE?
Maybe the context of the
rule can help in some way?

Thanks in advance,
Anthony Bargnesi