[antlr-interest] Why don't parsers support character ranges?

Wed Apr 23 18:45:57 PDT 2008

From: Hannes Schmidt <antlr5 at hannesschmidt.net>
Subject: [antlr-interest] Why don't parsers support character ranges?
Date: Tue, 22 Apr 2008 19:16:16 -0700
Message-ID: <480E9BF0.7060006 at hannesschmidt.net>

> Hi all,
> 
> I would like to use character ranges in a parser as illustrated in the following example (a very reduced version of my real-world grammar):
> 
> grammar test1;
> foo : before '@' after;
> before : 'a'..'z';
> after : 'm'..'z';
> 
> ANTLR generates a parser that ignores the range as if the grammar were
> 
> grammar test2;
> foo : before '@' after;
> before : ;
> after : ;
> 
> IOW, the grammar fails to parse the input "a at m". If I break the grammar up into a lexer and a parser as in
> 
> grammar test3;
> foo : BEFORE '@' AFTER;
> BEFORE : 'a'..'z';
> AFTER : 'm'..'z';
> 
> the generated code fails to parse "a at m" with a MismatchedTokeException at the 'm'. This is because ANTLR silently prioritizes BEFORE even though its set of characters intersects that of AFTER. Swapping BEFORE and AFTER would generate a parser that fails to recognize "m at m".
> 
> So here are  my questions:
> 
> Why can't I use ranges in parsers?
> 
> Why doesn't ANTLR emit a warning when it ignores ranges in grammar rules?
> 
> How can I emulate the missing range feature without obfuscating my grammar too much? Semantic predicates?
> 
> Now let me put my tinfoil hat on and theorize a little bit: I think that the root cause of  my confusion is ANTLR's distinction between lexer and parser. I think this distinction is purely historical and ANTLR might be better of without it. When writing grammars, I often find myself in situations where I know that certain lexer rules make sense in a certain parser context only but that context is not available to the lexer because the state that defines it is maintained in the parser.

I have to disagree with you. As I have addressed this particular problem when
using PCCTS (don't ask and I won't tell why) I found that the best method to
handle these types of problems was to divide the lexer grammar into groups
(this isn't the correct term, but it is easy enought to dig it up).
By changing the active lexer group as I progress throught the grammar, I also
changes the set of token definitions which can be generated and any inclusion
problems as you indicate will be eliminated.

Why did this problem occur? Well, in the ideal case would each and every
decision point in the grammar has a unique subset of the lexer definition. That
would constitute a correct or at least near correct state compaction of a full
character-based DFA into a lexer/parser DFA pair. Infact, there would be many
very small lexer DFAs. This has traditionally not been done, but rather an
assumption have been made that a globally unique lexer definition can be
written and this is for many cases possible. As a feature, scoping have been
provided, to aid in solving the most accute cases.

What I found was that forming these subsets always resolved what I considered
"legal" constructs. So, backing out... what you could do is to actually form
all the subsets at each parser DFA decission point. You can now safely reduce
the subsets by combining subsets that never disagree on produced token for any
sequence of characters. From this basic rule, you can significantly reduce the
number of needed subsets. This reduction does infact not need to be based on
parser based subsets, but initial analysis can be done in the lexer analysis.

FOO: (a|b)b
BAR: bb
FOZ: bc

FOO and BAR does not mix well. For the string bb the lexer has no option but to
select either one of them if they are in the same lexer domain. FOZ can be in
either the same domain as FOO and BAR, infact it can be included in both
domains. A lexer driver approach would thus be that for each lexer rule to
include, it should be attempted to be included in all lexer domains, and if
not included into any of them, a new lexer domain is formed. Excess domains can
be tossed with parser knowledge when each decission point has selected amongst
the available domains.

One might however be fooled to beleive that all cases will be handled correctly
from a character-DFA world. This is not so. There might be cases where FOO and
BAR is both legal at the same decision point, but that only later information
in the grammar. This require lookahead processing, assuming that no character
difference prior to the conflict would separate the case, and this will always
require parser roll-back manuevers.

While some of the problems you address is valid, many cases can be resolved and
the can be resolved by means of automatic tools. The reduced complexity of each
DFA generation and ease of debugging is both arguments for why this division
is still a sound way of going about things.

So I end up thinking that your question is asked the wrong way around. Why is
it that the nifty tools does not solve this behind my back for the 95% of the
time that it is not too hard to spot it, if it was looking at it?
The exact methods discussed is most probably not ideal from any perspective,
but some good thinking about it should give some indications as to what type
of algorithms is bad and what is promessing.

> I fondly remember my CS101 classes when we wrote recursive descent parsers for LL(*) in Opal (a functional language similar to Haskell). We didn't have to distinguish between lexer and parser and it felt very liberating. ;-)
> 

Functional languages are liberating in mind, but hard to optimize. Just reading
a state of the art PhD on the topic can result in minor functional disorders.

Cheers,
Magnus