[antlr-interest] Why don't parsers support character ranges?
Peter Nann
peter.nann at vecommerce.com.au
Tue Apr 22 20:16:25 PDT 2008
I am quite new to parsing...
But I too don't yet get the reasoning behind using such a 'dumb'
(context un-aware) front-end lexer as seemingly standard practice in
parsing, considering the restrictions that seems to impose.
I assume there is a good reason for it, but this noobie can't see it.
In fact, this noobie tried to write (what I thought was) a fairly simple
'rewriting' parser, but ran into difficulties, and went back to exactly
using a hand-coded recursive descent parser which our resident Java guru
whipped up in an afternoon.
- He DID use a standard design lexer front-end, HOWEVER, he quickly
found cases where he had to switch lexer/parser mid-stream to solve a
particular problem...
So, I'm sure it is my lack of knowledge in the area, but I hear where
you are coming from Hannes!
I too fondly remember my liberation in my similar CS 101 class...
-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Hannes Schmidt
Sent: Wednesday, 23 April 2008 12:16 PM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Why don't parsers support character ranges?
Hi all,
I would like to use character ranges in a parser as illustrated in the
following example (a very reduced version of my real-world grammar):
grammar test1;
foo : before '@' after;
before : 'a'..'z';
after : 'm'..'z';
ANTLR generates a parser that ignores the range as if the grammar were
grammar test2;
foo : before '@' after;
before : ;
after : ;
IOW, the grammar fails to parse the input "a at m". If I break the grammar
up into a lexer and a parser as in
grammar test3;
foo : BEFORE '@' AFTER;
BEFORE : 'a'..'z';
AFTER : 'm'..'z';
the generated code fails to parse "a at m" with a MismatchedTokeException
at the 'm'. This is because ANTLR silently prioritizes BEFORE even
though its set of characters intersects that of AFTER. Swapping BEFORE
and AFTER would generate a parser that fails to recognize "m at m".
So here are my questions:
Why can't I use ranges in parsers?
Why doesn't ANTLR emit a warning when it ignores ranges in grammar
rules?
How can I emulate the missing range feature without obfuscating my
grammar too much? Semantic predicates?
Now let me put my tinfoil hat on and theorize a little bit: I think that
the root cause of my confusion is ANTLR's distinction between lexer and
parser. I think this distinction is purely historical and ANTLR might be
better of without it. When writing grammars, I often find myself in
situations where I know that certain lexer rules make sense in a certain
parser context only but that context is not available to the lexer
because the state that defines it is maintained in the parser.
I fondly remember my CS101 classes when we wrote recursive descent
parsers for LL(*) in Opal (a functional language similar to Haskell). We
didn't have to distinguish between lexer and parser and it felt very
liberating. ;-)
More information about the antlr-interest
mailing list