[antlr-interest] Why don't parsers support character ranges?

Thu Apr 24 08:56:38 PDT 2008

Magnus Danielson wrote:
> From: Hannes Schmidt <antlr5 at hannesschmidt.net>
> Subject: [antlr-interest] Why don't parsers support character ranges?
> Date: Tue, 22 Apr 2008 19:16:16 -0700
> Message-ID: <480E9BF0.7060006 at hannesschmidt.net>
>
>   
>> Hi all,
>>
>> I would like to use character ranges in a parser as illustrated in the following example (a very reduced version of my real-world grammar):
>>
>> grammar test1;
>> foo : before '@' after;
>> before : 'a'..'z';
>> after : 'm'..'z';
>>
>> ANTLR generates a parser that ignores the range as if the grammar were
>>
>> grammar test2;
>> foo : before '@' after;
>> before : ;
>> after : ;
>>
>> IOW, the grammar fails to parse the input "a at m". If I break the grammar up into a lexer and a parser as in
>>
>> grammar test3;
>> foo : BEFORE '@' AFTER;
>> BEFORE : 'a'..'z';
>> AFTER : 'm'..'z';
>>
>> the generated code fails to parse "a at m" with a MismatchedTokeException at the 'm'. This is because ANTLR silently prioritizes BEFORE even though its set of characters intersects that of AFTER. Swapping BEFORE and AFTER would generate a parser that fails to recognize "m at m".
>>
>> So here are  my questions:
>>
>> Why can't I use ranges in parsers?
>>
>> Why doesn't ANTLR emit a warning when it ignores ranges in grammar rules?
>>
>> How can I emulate the missing range feature without obfuscating my grammar too much? Semantic predicates?
>>
>> Now let me put my tinfoil hat on and theorize a little bit: I think that the root cause of  my confusion is ANTLR's distinction between lexer and parser. I think this distinction is purely historical and ANTLR might be better of without it. When writing grammars, I often find myself in situations where I know that certain lexer rules make sense in a certain parser context only but that context is not available to the lexer because the state that defines it is maintained in the parser.
>>     
>
> I have to disagree with you. As I have addressed this particular problem when
> using PCCTS (don't ask and I won't tell why) I found that the best method to
> handle these types of problems was to divide the lexer grammar into groups
> (this isn't the correct term, but it is easy enought to dig it up).
> By changing the active lexer group as I progress throught the grammar, I also
> changes the set of token definitions which can be generated and any inclusion
> problems as you indicate will be eliminated.
>
>   
Agreed, this is a nifty solution and a viable alternative to what 
Johannes and Troy have suggested (manually making tokens disjunctive). 
It's just that this is too much overhead for 80% of all grammars and 80% 
of all ANTLR users. Eliminate the lexer concept and all of this goes 
away! For the performance-conscious there should be an option to  front  
the  parser with a token-generating  lexer, but that should be optional, 
not the default. I'm guessing that the majority of users would rather 
start with an intuitive grammar and see if it needs optimizing. The 
forced lexer/parser separation is a typical case of premature 
optimization. Don't get me wrong, I love ANTLR, it is far better than 
anything I have used before, especially because of it's power and 
flexibility. But that doesn't mean we can't improve it, right?
> Why did this problem occur? Well, in the ideal case would each and every
> decision point in the grammar has a unique subset of the lexer definition. That
> would constitute a correct or at least near correct state compaction of a full
> character-based DFA into a lexer/parser DFA pair. Infact, there would be many
> very small lexer DFAs. This has traditionally not been done, but rather an
> assumption have been made that a globally unique lexer definition can be
> written and this is for many cases possible. As a feature, scoping have been
> provided, to aid in solving the most accute cases.
>
> What I found was that forming these subsets always resolved what I considered
> "legal" constructs. So, backing out... what you could do is to actually form
> all the subsets at each parser DFA decission point. You can now safely reduce
> the subsets by combining subsets that never disagree on produced token for any
> sequence of characters. From this basic rule, you can significantly reduce the
> number of needed subsets. This reduction does infact not need to be based on
> parser based subsets, but initial analysis can be done in the lexer analysis.
>
> FOO: (a|b)b
> BAR: bb
> FOZ: bc
>
> FOO and BAR does not mix well. For the string bb the lexer has no option but to
> select either one of them if they are in the same lexer domain. FOZ can be in
> either the same domain as FOO and BAR, infact it can be included in both
> domains. A lexer driver approach would thus be that for each lexer rule to
> include, it should be attempted to be included in all lexer domains, and if
> not included into any of them, a new lexer domain is formed. Excess domains can
> be tossed with parser knowledge when each decission point has selected amongst
> the available domains.
>
> One might however be fooled to beleive that all cases will be handled correctly
> from a character-DFA world. This is not so. There might be cases where FOO and
> BAR is both legal at the same decision point, but that only later information
> in the grammar. This require lookahead processing, assuming that no character
> difference prior to the conflict would separate the case, and this will always
> require parser roll-back manuevers.
>   
If there was no lexer, the parser would handle this with disambiguating 
predicates.

> While some of the problems you address is valid, many cases can be resolved and
> the can be resolved by means of automatic tools. The reduced complexity of each
> DFA generation and ease of debugging is both arguments for why this division
> is still a sound way of going about things.
>   
AFAIK, no tools exist for generating grammar specs that switch lexer 
DFAs or disambiguate tokens.

> So I end up thinking that your question is asked the wrong way around. Why is
> it that the nifty tools does not solve this behind my back for the 95% of the
> time that it is not too hard to spot it, if it was looking at it?
> The exact methods discussed is most probably not ideal from any perspective,
> but some good thinking about it should give some indications as to what type
> of algorithms is bad and what is promessing.
>
>   
>> I fondly remember my CS101 classes when we wrote recursive descent parsers for LL(*) in Opal (a functional language similar to Haskell). We didn't have to distinguish between lexer and parser and it felt very liberating. ;-)
>>
>>     
>
> Functional languages are liberating in mind, but hard to optimize.
Not true. It's the lack of state that makes them difficult to use to 
solve real-world problems.
>  Just reading
> a state of the art PhD on the topic can result in minor functional disorders.
>   
Terence compares StringTemplate with a functional language and shows 
that in that similiarity lies it's strength. FLs have there use and 
understanding the paradigm them can be enlightening.
> Cheers,
> Magnus
>