[antlr-interest] Why don't parsers support character ranges?

Wed Apr 23 17:01:13 PDT 2008

Hmmm, I was hoping for more than the 'efficiency' argument...
I am wondering if that argument is about 10 years past its use-by
date...
We are not in the days of single-digit-Megahertz and RAM measured in k
anymore... when lexx and yacc were written...

It would depend on the scale of parsing you need to do of course, but
for small-scale parsing I would question whether CPU and RAM matters any
more on that task...

I will have to take your word about 'combinatorial explosion' for some
problems, but I thought simple RDP's could pretty much break down to one
branch (as in: switch statement) per character (or token if you tokenize
it), which doesn't seem excessive, or combinatorial.
 - But, yes, that was just my CS101 project!

And using an analogy of how a human does it, versus how a computer
should do it, is not a valid technical argument at all in my book.

Sorry to be a sour-puss, but I was quite excited about ANTLR at first
look, but then got disappointed very quickly, so I'm a bit like a child
who just broke his favourite toy...  ;-)

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Loring Craymer
Sent: Thursday, 24 April 2008 1:09 AM
To: Hannes Schmidt; antlr-interest at antlr.org
Subject: Re: [antlr-interest] Why don't parsers support character
ranges?

I think that Gavin answered the range question pretty well.  As to the
separation of lexer and parser, that is not an historical artifact.  Had
you tried to use your CS 101 LL1 parser for production use, you would
have been very disappointed with both its performance and memory
footprint.  Lexers work with characters; parsers work with tokens.  If
your characters are packaged as tokens, memory usage expands
dramatically and it takes longer to dereference the token to look at the
character.  That can easily cost you a factor of 10-100 in performance;
ANTLR 2 just packaged characters into strings, and that cost a factor of
5 over more optimal approaches (ANTLR 3 does much better).  The impact
of using a single stage of analysis on the parser is equally dramatic:
"foo is blue" goes from a 3 token analysis problem ("foo" "is" "blue")
to an 11 token analysis problem (can you say 'combinatorial explosion')
with a corresponding dramatic decrease in performance.

As to the  "before '@' after" example you give (assuming that you really
do need to distinguish BEFORE and AFTER tokens), the best approach might
be to support multiple token emission from lexer rules so  that you do
something like

AT :  'a' .. 'l' { emit(BEFORE); } '@' { emit(AT); } 'm'..'z'
{emit(AFTER); } ;

foo : BEFORE AT AFTER

but ANTLR 3 does not quite do that yet.

Note that the natural approach to reading text is in fact to lex
characters into tokens which we call "words" (reading text without
spaces requires some effort) and then to parse sentences made of words.
Spaces and punctuation make a big difference because they support
hierarchical processing.

--Loring

----- Original Message ----
> From: Hannes Schmidt <antlr5 at hannesschmidt.net>
> To: antlr-interest at antlr.org
> Sent: Tuesday, April 22, 2008 7:16:16 PM
> Subject: [antlr-interest] Why don't parsers support character ranges?
> 
> Hi all,
> 
> I would like to use character ranges in a parser as illustrated in the

> following example (a very reduced version of my real-world grammar):
> 
> grammar test1;
> foo : before '@' after;
> before : 'a'..'z';
> after : 'm'..'z';
> 
> ANTLR generates a parser that ignores the range as if the grammar were
> 
> grammar test2;
> foo : before '@' after;
> before : ;
> after : ;
> 
> IOW, the grammar fails to parse the input "a at m". If I break the 
> grammar up into a lexer and a parser as in
> 
> grammar test3;
> foo : BEFORE '@' AFTER;
> BEFORE : 'a'..'z';
> AFTER : 'm'..'z';
> 
> the generated code fails to parse "a at m" with a MismatchedTokeException

> at the 'm'. This is because ANTLR silently prioritizes BEFORE even 
> though its set of characters intersects that of AFTER. Swapping BEFORE

> and AFTER would generate a parser that fails to recognize "m at m".
> 
> So here are  my questions:
> 
> Why can't I use ranges in parsers?
> 
> Why doesn't ANTLR emit a warning when it ignores ranges in grammar
rules?
> 
> How can I emulate the missing range feature without obfuscating my 
> grammar too much? Semantic predicates?
> 
> Now let me put my tinfoil hat on and theorize a little bit: I think 
> that the root cause of  my confusion is ANTLR's distinction between 
> lexer and parser. I think this distinction is purely historical and 
> ANTLR might be better of without it. When writing grammars, I often 
> find myself in situations where I know that certain lexer rules make 
> sense in a certain parser context only but that context is not 
> available to the lexer because the state that defines it is maintained
in the parser.
> 
> I fondly remember my CS101 classes when we wrote recursive descent 
> parsers for LL(*) in Opal (a functional language similar to Haskell). 
> We didn't have to distinguish between lexer and parser and it felt 
> very liberating. ;-)

________________________________________________________________________
____________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile.  Try it now.
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ