[antlr-interest] Why don't parsers support character ranges?

Wed Apr 23 20:17:00 PDT 2008

Maybe you are trying to do too much in the parser and lexer. Where you shouldn't be afraid to introduce what looks like overhead is to add the extra step of semantic verification. 

In such a case you would not distinguish BEFORE and AFTER, or any overlapping sequence that isn't 'obvious' but walk a syntactically sound tree that just has say a token called TEXT and report much better errors such as "The 'before' part of the expression xxxxxx cannot contain characters from the set ..." and so on. 

This kind of error is usually much more useful to the programmer/author as it has better context than "Syntax error at 'abcde'", which doesn't really give you that much useful information. Sure, you might be able to see that you got an AFTER token but wanted a BEFORE token, but then you will be trying to build semantics into your error message handler and walking a tree provides a structure where such things are much easier to pin down.

Perhaps an easier to see example might be keywords such as 'public' 'private' etc on say a property in a class of some language. You could try to construct a parser that threw syntax errors if the word public was used on something that can never be public, but the parser is much nicer if you just allow all the methods, then in your tree parser, trigger a semantic error that says "properties cannot be declared virtual" or whatever. Again, you might think you could work out such things in the parser, but you will either get "Syntax error at 'virtual'" which is no use to man nor beast, or you will try and track the virtual, public and so on and try to give semantic errors while parsing.

Without knowing what you are trying to parse, I would still suggest that you simplify what you are trying to lex into, which will simplify the parser and cut down on syntax errors that are often difficult to interpret and allow you much better context. Personally, I think that the C# analyzer/parser in Visual Studio 2005+ is an excellent example of what to strive for - there are almost no 'syntax' errors, which means the errors it gives you are really rather useful.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Hannes Schmidt
> Sent: Wednesday, April 23, 2008 6:43 PM
> To: Daniels, Troy (US SSA); Johannes Luber
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Why don't parsers support character
> ranges?
> 
> Johannes and Troy,
> 
> Thanks, guys. I was afraid that this was my only option. My real-world
> grammar has dozens of tokens with non-disjunctive sets of characters. I
> guess, I'll have to play trained monkey ...
> 
> On 4/23/08 3:39 PM, "Daniels, Troy (US SSA)"
> <troy.daniels at baesystems.com>
> wrote:
> 
> >
> >
> >>
> >>
> >> You could alternatively use:
> >>
> >> grammar test4;
> >> foo : BEFORE '@' AFTER;
> >> BEFORE : A_TO_L | M_TO_Z;
> >> AFTER : M_TO_Z;
> >> fragment A_TO_L: 'a'..'l';
> >> fragment M_TO_Z: 'm'..'z';
> >>
> >
> > Actually, you can't.  Nothing will ever match AFTER, since BEFORE
> will
> > consume it.  If you make BEFORE and AFTER parser rules, that would
> work.
> >
> > grammar test6;
> >  foo : before '@' after;
> >  before : A_TO_L | M_TO_Z;
> >  after : M_TO_Z;
> >
> > A_TO_L: 'a'..'l';
> > M_TO_Z: 'm'..'z';
> >
> >
> > Troy
> >
> >
> >> But I suppose it is easier for error messages, if you leave
> >> A_TO_L in for AFTER and check it in a later stage for correctness.
> >>
> >> grammar test5;
> >> foo : ALPHA '@' ALPHA;
> >> ALPHA: 'a'..'z';
> >>
> >> Johannes
> >>
>