[antlr-interest] Why don't parsers support character ranges?

Fri May 2 20:44:48 PDT 2008

This is just to follow up on this thread in case someone using Google finds
it in the list archives.

The parser generator I ended up using is Rats! which was originally written
for the Extensible C project (XTC). It was extremely suitable for my needs
because it's written in Java and generates parsers in Java. More
importantly, it also is scanner-less, i.e. it doesn't need a lexer DFA that
parses characters into tokens. I had always suspected that the rule about
lexers being an unavoidable necessity is mere folklore. The existence of a
mature and efficient scanner-less parser generator confirms that suspicion.

In the real world, my actual project is the implementation of an
RFC2822-compliant email address parser. With Rats! I was able to basically
copy and paste the syntax specification from the RFC into the Rats! grammar.
Only minor cosmetic changes were necessary (changing production names to
camel case and substituting operator symbols). To me this indicates that
PEGs are both powerful and intuitive to use and that with Rats! there is a
superb implementation of PEGs.

The ANTLR grammar from my original post translates to the following Rats!
grammar:

module Test1;
option withParseTree;
public String Foo = Before '@' After  EOF ;
String Before = [a-z] ;
String After = [m-z] ;

On 4/22/08 7:16 PM, "Hannes Schmidt" <antlr5 at hannesschmidt.net> wrote:

> Hi all,
> 
> I would like to use character ranges in a parser as illustrated in the
> following example (a very reduced version of my real-world grammar):
> 
> grammar test1;
> foo : before '@' after;
> before : 'a'..'z';
> after : 'm'..'z';
> 
> ANTLR generates a parser that ignores the range as if the grammar were
> 
> grammar test2;
> foo : before '@' after;
> before : ;
> after : ;
> 
> IOW, the grammar fails to parse the input "a at m". If I break the grammar
> up into a lexer and a parser as in
> 
> grammar test3;
> foo : BEFORE '@' AFTER;
> BEFORE : 'a'..'z';
> AFTER : 'm'..'z';
> 
> the generated code fails to parse "a at m" with a MismatchedTokeException
> at the 'm'. This is because ANTLR silently prioritizes BEFORE even
> though its set of characters intersects that of AFTER. Swapping BEFORE
> and AFTER would generate a parser that fails to recognize "m at m".
> 
> So here are  my questions:
> 
> Why can't I use ranges in parsers?
> 
> Why doesn't ANTLR emit a warning when it ignores ranges in grammar rules?
> 
> How can I emulate the missing range feature without obfuscating my
> grammar too much? Semantic predicates?
> 
> Now let me put my tinfoil hat on and theorize a little bit: I think that
> the root cause of  my confusion is ANTLR's distinction between lexer and
> parser. I think this distinction is purely historical and ANTLR might be
> better of without it. When writing grammars, I often find myself in
> situations where I know that certain lexer rules make sense in a certain
> parser context only but that context is not available to the lexer
> because the state that defines it is maintained in the parser.
> 
> I fondly remember my CS101 classes when we wrote recursive descent
> parsers for LL(*) in Opal (a functional language similar to Haskell). We
> didn't have to distinguish between lexer and parser and it felt very
> liberating. ;-)