[antlr-interest] Why don't parsers support character ranges?

Tue Apr 22 20:16:25 PDT 2008

I am quite new to parsing...

But I too don't yet get the reasoning behind using such a 'dumb'
(context un-aware) front-end lexer as seemingly standard practice in
parsing, considering the restrictions that seems to impose.

I assume there is a good reason for it, but this noobie can't see it.

In fact, this noobie tried to write (what I thought was) a fairly simple
'rewriting' parser, but ran into difficulties, and went back to exactly
using a hand-coded recursive descent parser which our resident Java guru
whipped up in an afternoon.
 - He DID use a standard design lexer front-end, HOWEVER, he quickly
found cases where he had to switch lexer/parser mid-stream to solve a
particular problem...

So, I'm sure it is my lack of knowledge in the area, but I hear where
you are coming from Hannes!
I too fondly remember my liberation in my similar CS 101 class...

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Hannes Schmidt
Sent: Wednesday, 23 April 2008 12:16 PM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Why don't parsers support character ranges?

Hi all,

I would like to use character ranges in a parser as illustrated in the
following example (a very reduced version of my real-world grammar):

grammar test1;
foo : before '@' after;
before : 'a'..'z';
after : 'm'..'z';

ANTLR generates a parser that ignores the range as if the grammar were

grammar test2;
foo : before '@' after;
before : ;
after : ;

IOW, the grammar fails to parse the input "a at m". If I break the grammar
up into a lexer and a parser as in

grammar test3;
foo : BEFORE '@' AFTER;
BEFORE : 'a'..'z';
AFTER : 'm'..'z';

the generated code fails to parse "a at m" with a MismatchedTokeException
at the 'm'. This is because ANTLR silently prioritizes BEFORE even
though its set of characters intersects that of AFTER. Swapping BEFORE
and AFTER would generate a parser that fails to recognize "m at m".

So here are  my questions:

Why can't I use ranges in parsers?

Why doesn't ANTLR emit a warning when it ignores ranges in grammar
rules?

How can I emulate the missing range feature without obfuscating my
grammar too much? Semantic predicates?

Now let me put my tinfoil hat on and theorize a little bit: I think that
the root cause of  my confusion is ANTLR's distinction between lexer and
parser. I think this distinction is purely historical and ANTLR might be
better of without it. When writing grammars, I often find myself in
situations where I know that certain lexer rules make sense in a certain
parser context only but that context is not available to the lexer
because the state that defines it is maintained in the parser.

I fondly remember my CS101 classes when we wrote recursive descent
parsers for LL(*) in Opal (a functional language similar to Haskell). We
didn't have to distinguish between lexer and parser and it felt very
liberating. ;-)