[antlr-interest] Why don't parsers support character ranges?

Wed Apr 23 08:08:34 PDT 2008

I think that Gavin answered the range question pretty well.  As to the separation of lexer and parser, that is not an historical artifact.  Had you tried to use your CS 101 LL1 parser for production use, you would have been very disappointed with both its performance and memory footprint.  Lexers work with characters; parsers work with tokens.  If your characters are packaged as tokens, memory usage expands dramatically and it takes longer to dereference the token to look at the character.  That can easily cost you a factor of 10-100 in performance; ANTLR 2 just packaged characters into strings, and that cost a factor of 5 over more optimal approaches (ANTLR 3 does much better).  The impact of using a single stage of analysis on the parser is equally dramatic:  "foo is blue" goes from a 3 token analysis problem ("foo" "is" "blue") to an 11 token analysis problem (can you say 'combinatorial explosion') with a corresponding dramatic decrease in performance.

As to the  "before '@' after" example you give (assuming that you really do need to distinguish BEFORE and AFTER tokens), the best approach might be to support multiple token emission from lexer rules so  that you do something like

AT :  'a' .. 'l' { emit(BEFORE); } '@' { emit(AT); } 'm'..'z' {emit(AFTER); } ;

foo : BEFORE AT AFTER

but ANTLR 3 does not quite do that yet.

Note that the natural approach to reading text is in fact to lex characters into tokens which we call "words" (reading text without spaces requires some effort) and then to parse sentences made of words.  Spaces and punctuation make a big difference because they support hierarchical processing.

--Loring

----- Original Message ----
> From: Hannes Schmidt <antlr5 at hannesschmidt.net>
> To: antlr-interest at antlr.org
> Sent: Tuesday, April 22, 2008 7:16:16 PM
> Subject: [antlr-interest] Why don't parsers support character ranges?
> 
> Hi all,
> 
> I would like to use character ranges in a parser as illustrated in the 
> following example (a very reduced version of my real-world grammar):
> 
> grammar test1;
> foo : before '@' after;
> before : 'a'..'z';
> after : 'm'..'z';
> 
> ANTLR generates a parser that ignores the range as if the grammar were
> 
> grammar test2;
> foo : before '@' after;
> before : ;
> after : ;
> 
> IOW, the grammar fails to parse the input "a at m". If I break the grammar 
> up into a lexer and a parser as in
> 
> grammar test3;
> foo : BEFORE '@' AFTER;
> BEFORE : 'a'..'z';
> AFTER : 'm'..'z';
> 
> the generated code fails to parse "a at m" with a MismatchedTokeException 
> at the 'm'. This is because ANTLR silently prioritizes BEFORE even 
> though its set of characters intersects that of AFTER. Swapping BEFORE 
> and AFTER would generate a parser that fails to recognize "m at m".
> 
> So here are  my questions:
> 
> Why can't I use ranges in parsers?
> 
> Why doesn't ANTLR emit a warning when it ignores ranges in grammar rules?
> 
> How can I emulate the missing range feature without obfuscating my 
> grammar too much? Semantic predicates?
> 
> Now let me put my tinfoil hat on and theorize a little bit: I think that 
> the root cause of  my confusion is ANTLR's distinction between lexer and 
> parser. I think this distinction is purely historical and ANTLR might be 
> better of without it. When writing grammars, I often find myself in 
> situations where I know that certain lexer rules make sense in a certain 
> parser context only but that context is not available to the lexer 
> because the state that defines it is maintained in the parser.
> 
> I fondly remember my CS101 classes when we wrote recursive descent 
> parsers for LL(*) in Opal (a functional language similar to Haskell). We 
> didn't have to distinguish between lexer and parser and it felt very 
> liberating. ;-)

      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ