[antlr-interest] Why don't parsers support character ranges?

Tue Apr 22 19:16:16 PDT 2008

Hi all,

I would like to use character ranges in a parser as illustrated in the 
following example (a very reduced version of my real-world grammar):

grammar test1;
foo : before '@' after;
before : 'a'..'z';
after : 'm'..'z';

ANTLR generates a parser that ignores the range as if the grammar were

grammar test2;
foo : before '@' after;
before : ;
after : ;

IOW, the grammar fails to parse the input "a at m". If I break the grammar 
up into a lexer and a parser as in

grammar test3;
foo : BEFORE '@' AFTER;
BEFORE : 'a'..'z';
AFTER : 'm'..'z';

the generated code fails to parse "a at m" with a MismatchedTokeException 
at the 'm'. This is because ANTLR silently prioritizes BEFORE even 
though its set of characters intersects that of AFTER. Swapping BEFORE 
and AFTER would generate a parser that fails to recognize "m at m".

So here are  my questions:

Why can't I use ranges in parsers?

Why doesn't ANTLR emit a warning when it ignores ranges in grammar rules?

How can I emulate the missing range feature without obfuscating my 
grammar too much? Semantic predicates?

Now let me put my tinfoil hat on and theorize a little bit: I think that 
the root cause of  my confusion is ANTLR's distinction between lexer and 
parser. I think this distinction is purely historical and ANTLR might be 
better of without it. When writing grammars, I often find myself in 
situations where I know that certain lexer rules make sense in a certain 
parser context only but that context is not available to the lexer 
because the state that defines it is maintained in the parser.

I fondly remember my CS101 classes when we wrote recursive descent 
parsers for LL(*) in Opal (a functional language similar to Haskell). We 
didn't have to distinguish between lexer and parser and it felt very 
liberating. ;-)