[antlr-interest] Problem with overlapping tokens

Wed Jul 18 08:51:47 PDT 2007

Hi everyone,

I am new to ANTLR 3 and have a grammar where some tokens "overlap"
partially.
I think an excerpt from my grammar explains best, look for the last two
(non-fragment) rules:

**************************************
fragment DIGIT : '0'..'9';

fragment LETTER    
    : 'a'..'z' | 'A'..'Z';

fragment LETTER_OR_DIGIT
    :     LETTER | DIGIT;

fragment URI_NON_RESERVED_SPECIAL_CHARS
    :     '-' | '.' | UNDERSCORE | '~';

fragment URI_RESERVED
    :     URI_MAIN_DELIMS |URI_SUB_DELIMS;

fragment URI_MAIN_DELIMS 
    :     ':' | '/' | '?' | '#' | '[' | ']' | '@';

fragment URI_SUB_DELIMS
    :    '!' | '$' | '&' | '\'' |  '('  | ')' | '*'  | '+' | ',' | ';' |
'=';

fragment URI_PERCENT_ENCODED
    :    '\%' HEXDIGIT HEXDIGIT;

IDENTIFIER
    : (LETTER | '_') (LETTER_OR_DIGIT | '_')*;

URI_SCHEME
    :    LETTER (LETTER_OR_DIGIT | '+' | '-' | '.' | ':')* ;

**************************************
So URI_SCHEME subsumes IDENTIFIER. I understand that ANTLR will work
greedily, i.e., will use the Lexer rule that consumes the most input
chars. If the length is identical, then it will use the rule that comes
first in the grammar. Is this correct?
So in the above case, whenever a string ends with "+", "-" etc. I will
get a URI_SCHEME token containing these additional chars. However, in
some places I do not want to recognize URI_SCHEMES (e.g. in expression,
when I add two IDENTIFIERS like this "foo+bar"), so instead of a tokens
IDENTIFIER, PLUS, IDENTIFIER I get a single token URI_SCHEME.

The problem is not limited to these two rules, here are more:

**************************************
HOST    :    (LETTER_OR_DIGIT | URI_NON_RESERVED_SPECIAL_CHARS |
URI_PERCENT_ENCODED | URI_SUB_DELIMS)*;

USERINFO:    ( LETTER_OR_DIGIT | URI_NON_RESERVED_SPECIAL_CHARS |
URI_PERCENT_ENCODED | URI_SUB_DELIMS |':')*

PATHCHARS:    ( LETTER_OR_DIGIT | URI_NON_RESERVED_SPECIAL_CHARS |
URI_PERCENT_ENCODED | URI_SUB_DELIMS |':' | '@')*;

FRAGMENT_OR_QUERY:        ( LETTER_OR_DIGIT |
URI_NON_RESERVED_SPECIAL_CHARS | URI_PERCENT_ENCODED | URI_SUB_DELIMS
|':' | '@' | '/' | '?')* ;

**************************************

Each rule is overlapping with or a superset of the previous one, the
"new" chars in the later rules are often delimiters in those places
where the other rules are used (e.g., the ':' delimits the URI_SCHEME,
but is consumed by FRAGMENT_OR_QUERY.
You might have guessed right now that I have a grammar containing
identifiers and URIs. If I use the rules as specified with the parser
rule, the
FRAGMENT_OR_QUERY rule will eat URIs like this completely, instead of
giving me URI_SCHEME, HOST, PATHCHARS: http://somewhere.com/something.

I have tried a few things, but didn't fidn a clean way to resolve this
issue? Any suggestions? Should I move HOST, USERINFO etc. into the
parser rules? Would semantic/syntactic predicates be of help? Or am I
just missing something obvious?

Thanks in advance

Regards

JG