[antlr-interest] Why don't parsers support character ranges?

Wed Apr 23 18:35:56 PDT 2008

Uh, oh, what have I started? I shouldn't have theorized ...

On 4/23/08 8:08 AM, "Loring Craymer" <lgcraymer at yahoo.com> wrote:

> I think that Gavin answered the range question pretty well.  As to the
> separation of lexer and parser, that is not an historical artifact.  Had you
> tried to use your CS 101 LL1 parser for production use, you would have been
> very disappointed with both its performance and memory footprint.

On the contrary, I was pleasantly surprised at how well they performed. I
guess it was due to the fact that even back then Opal did a good job
detecting and eliminating tail recursion. BTW, our parsers weren't LL(1)
because they did backtracking, more like LL(*).

> Lexers work 
> with characters; parsers work with tokens.

Says who? Just because it's been like this for 40 years doesn't mean it can
never change.

>  If your characters are packaged as
> tokens, memory usage expands dramatically and it takes longer to dereference
> the token to look at the character.  That can easily cost you a factor of
> 10-100 in performance; ANTLR 2 just packaged characters into strings, and that
> cost a factor of 5 over more optimal approaches (ANTLR 3 does much better).

In most cases I don't really care whether my recognizer finishes in 50 or
250 milliseconds, especially if I need to spend ten additional hours
understanding, debugging and maintaining a grammar that's more complex than
it needs to be. 

Don't get me wrong, for some languages the lexer/parser separation makes
sense and yields concise, straight-forward grammars. Unfortunately, those
are just some languages, most likely the ones that were designed with that
separation in mind. In other cases (like mine) that separation yields
awkward grammars, riddled with predicates and parser logic duplicated in the
lexer. The lexer/parser separation is a huge sacrifice to performance and I
am simply not convinced that it's necessary anymore.

> The impact of using a single stage of analysis on the parser is equally
> dramatic:  "foo is blue" goes from a 3 token analysis problem ("foo" "is"
> "blue") to an 11 token analysis problem (can you say 'combinatorial
> explosion') with a corresponding dramatic decrease in performance.

Assuming that the average length of a token is constant, there is no
'combinatorial explosion' as you put it simply because c * O(f) = O(f).
Whether a mere linear decline can be called dramatic is entirely subjective.

> As to the  "before '@' after" example you give (assuming that you really do
> need to distinguish BEFORE and AFTER tokens), the best approach might be to
> support multiple token emission from lexer rules so  that you do something
> like
> 
> AT :  'a' .. 'l' { emit(BEFORE); } '@' { emit(AT); } 'm'..'z' {emit(AFTER); }
> ;
> 
> foo : BEFORE AT AFTER
> 
> but ANTLR 3 does not quite do that yet.
>
> Note that the natural approach to reading text is in fact to lex characters
> into tokens which we call "words" (reading text without spaces requires some
> effort) and then to parse sentences made of words.  Spaces and punctuation
> make a big difference because they support hierarchical processing.

That's a false analogy. What's natural to a human might not be the best
strategy for a machine. For some languages, having to split the recognizer
into a DFA and a TM *does* feel unnatural to me.

> 
> --Loring
> 
> 
> ----- Original Message ----
>> From: Hannes Schmidt <antlr5 at hannesschmidt.net>
>> To: antlr-interest at antlr.org
>> Sent: Tuesday, April 22, 2008 7:16:16 PM
>> Subject: [antlr-interest] Why don't parsers support character ranges?
>> 
>> Hi all,
>> 
>> I would like to use character ranges in a parser as illustrated in the
>> following example (a very reduced version of my real-world grammar):
>> 
>> grammar test1;
>> foo : before '@' after;
>> before : 'a'..'z';
>> after : 'm'..'z';
>> 
>> ANTLR generates a parser that ignores the range as if the grammar were
>> 
>> grammar test2;
>> foo : before '@' after;
>> before : ;
>> after : ;
>> 
>> IOW, the grammar fails to parse the input "a at m". If I break the grammar
>> up into a lexer and a parser as in
>> 
>> grammar test3;
>> foo : BEFORE '@' AFTER;
>> BEFORE : 'a'..'z';
>> AFTER : 'm'..'z';
>> 
>> the generated code fails to parse "a at m" with a MismatchedTokeException
>> at the 'm'. This is because ANTLR silently prioritizes BEFORE even
>> though its set of characters intersects that of AFTER. Swapping BEFORE
>> and AFTER would generate a parser that fails to recognize "m at m".
>> 
>> So here are  my questions:
>> 
>> Why can't I use ranges in parsers?
>> 
>> Why doesn't ANTLR emit a warning when it ignores ranges in grammar rules?
>> 
>> How can I emulate the missing range feature without obfuscating my
>> grammar too much? Semantic predicates?
>> 
>> Now let me put my tinfoil hat on and theorize a little bit: I think that
>> the root cause of  my confusion is ANTLR's distinction between lexer and
>> parser. I think this distinction is purely historical and ANTLR might be
>> better of without it. When writing grammars, I often find myself in
>> situations where I know that certain lexer rules make sense in a certain
>> parser context only but that context is not available to the lexer
>> because the state that defines it is maintained in the parser.
>> 
>> I fondly remember my CS101 classes when we wrote recursive descent
>> parsers for LL(*) in Opal (a functional language similar to Haskell). We
>> didn't have to distinguish between lexer and parser and it felt very
>> liberating. ;-)
> 
> 
> 
>       
> ______________________________________________________________________________
> ______
> Be a better friend, newshound, and
> know-it-all with Yahoo! Mobile.  Try it now.
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>