[antlr-interest] Why don't parsers support character ranges?

Wed Apr 23 19:57:06 PDT 2008

Like any other technical toy, it will give more 'hours of fun' if you work at it a little, but here are some things to avoid until you know what they are all 'mean':

1) Don't use 'ffff' in a parser rule, construct 'real' lexer rules;
2) Don't use the backtracking=true option, at least the global one, until you understand where it might be a good idea;
3) Remember to construct a lexer that tokenizes everything without knowing anything about context, it should [be able to] run through all the input and give out a set of tokens, then the parser will run against them. This is a common problem when people start out and start putting 'abc' in their parsers and is a lot easier to see by not using that in the parser until you have your head around what happens.
4) Don't invent languages that are indent sensitive and don't use terminals to surround blocks [that's isn;t a real rule ;-)]

On performance, the basic thing is that it ALWAYS matters. Not worrying about it as how things that were only going to be running once a week get run once every millisecond and kill systems. However, I think that you are underestimating the impact that Loring is referring to as while you might be able to have a lot of switch statements, you will need a really huge pile of them. There are some uses for non-lexing parsers, if you know where to use them, but they are not generally needed.

I think that if you persevere, perhaps starting with some of the examples to see if you can work out why they are constructed the way they are, then in a few weeks you will be very happy with it all; it is mostly a matter of gestalt.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Peter Nann
> Sent: Wednesday, April 23, 2008 5:01 PM
> To: Loring Craymer; Hannes Schmidt; antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Why don't parsers support character
> ranges?
> 
> 
> Hmmm, I was hoping for more than the 'efficiency' argument...
> I am wondering if that argument is about 10 years past its use-by
> date...
> We are not in the days of single-digit-Megahertz and RAM measured in k
> anymore... when lexx and yacc were written...
> 
> It would depend on the scale of parsing you need to do of course, but
> for small-scale parsing I would question whether CPU and RAM matters
> any
> more on that task...
> 
> I will have to take your word about 'combinatorial explosion' for some
> problems, but I thought simple RDP's could pretty much break down to
> one
> branch (as in: switch statement) per character (or token if you
> tokenize
> it), which doesn't seem excessive, or combinatorial.
>  - But, yes, that was just my CS101 project!
> 
> 
> And using an analogy of how a human does it, versus how a computer
> should do it, is not a valid technical argument at all in my book.
> 
> 
> Sorry to be a sour-puss, but I was quite excited about ANTLR at first
> look, but then got disappointed very quickly, so I'm a bit like a child
> who just broke his favourite toy...  ;-)
> 
> 
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org
> [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Loring Craymer
> Sent: Thursday, 24 April 2008 1:09 AM
> To: Hannes Schmidt; antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Why don't parsers support character
> ranges?
> 
> I think that Gavin answered the range question pretty well.  As to the
> separation of lexer and parser, that is not an historical artifact.
> Had
> you tried to use your CS 101 LL1 parser for production use, you would
> have been very disappointed with both its performance and memory
> footprint.  Lexers work with characters; parsers work with tokens.  If
> your characters are packaged as tokens, memory usage expands
> dramatically and it takes longer to dereference the token to look at
> the
> character.  That can easily cost you a factor of 10-100 in performance;
> ANTLR 2 just packaged characters into strings, and that cost a factor
> of
> 5 over more optimal approaches (ANTLR 3 does much better).  The impact
> of using a single stage of analysis on the parser is equally dramatic:
> "foo is blue" goes from a 3 token analysis problem ("foo" "is" "blue")
> to an 11 token analysis problem (can you say 'combinatorial explosion')
> with a corresponding dramatic decrease in performance.
> 
> As to the  "before '@' after" example you give (assuming that you
> really
> do need to distinguish BEFORE and AFTER tokens), the best approach
> might
> be to support multiple token emission from lexer rules so  that you do
> something like
> 
> AT :  'a' .. 'l' { emit(BEFORE); } '@' { emit(AT); } 'm'..'z'
> {emit(AFTER); } ;
> 
> foo : BEFORE AT AFTER
> 
> but ANTLR 3 does not quite do that yet.
> 
> Note that the natural approach to reading text is in fact to lex
> characters into tokens which we call "words" (reading text without
> spaces requires some effort) and then to parse sentences made of words.
> Spaces and punctuation make a big difference because they support
> hierarchical processing.
> 
> --Loring
> 
> 
> ----- Original Message ----
> > From: Hannes Schmidt <antlr5 at hannesschmidt.net>
> > To: antlr-interest at antlr.org
> > Sent: Tuesday, April 22, 2008 7:16:16 PM
> > Subject: [antlr-interest] Why don't parsers support character ranges?
> >
> > Hi all,
> >
> > I would like to use character ranges in a parser as illustrated in
> the
> 
> > following example (a very reduced version of my real-world grammar):
> >
> > grammar test1;
> > foo : before '@' after;
> > before : 'a'..'z';
> > after : 'm'..'z';
> >
> > ANTLR generates a parser that ignores the range as if the grammar
> were
> >
> > grammar test2;
> > foo : before '@' after;
> > before : ;
> > after : ;
> >
> > IOW, the grammar fails to parse the input "a at m". If I break the
> > grammar up into a lexer and a parser as in
> >
> > grammar test3;
> > foo : BEFORE '@' AFTER;
> > BEFORE : 'a'..'z';
> > AFTER : 'm'..'z';
> >
> > the generated code fails to parse "a at m" with a
> MismatchedTokeException
> 
> > at the 'm'. This is because ANTLR silently prioritizes BEFORE even
> > though its set of characters intersects that of AFTER. Swapping
> BEFORE
> 
> > and AFTER would generate a parser that fails to recognize "m at m".
> >
> > So here are  my questions:
> >
> > Why can't I use ranges in parsers?
> >
> > Why doesn't ANTLR emit a warning when it ignores ranges in grammar
> rules?
> >
> > How can I emulate the missing range feature without obfuscating my
> > grammar too much? Semantic predicates?
> >
> > Now let me put my tinfoil hat on and theorize a little bit: I think
> > that the root cause of  my confusion is ANTLR's distinction between
> > lexer and parser. I think this distinction is purely historical and
> > ANTLR might be better of without it. When writing grammars, I often
> > find myself in situations where I know that certain lexer rules make
> > sense in a certain parser context only but that context is not
> > available to the lexer because the state that defines it is
> maintained
> in the parser.
> >
> > I fondly remember my CS101 classes when we wrote recursive descent
> > parsers for LL(*) in Opal (a functional language similar to Haskell).
> > We didn't have to distinguish between lexer and parser and it felt
> > very liberating. ;-)
> 
> 
> 
> 
> _______________________________________________________________________
> _
> ____________
> Be a better friend, newshound, and
> know-it-all with Yahoo! Mobile.  Try it now.
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ