[antlr-interest] Another parsing question

Wed Aug 6 14:00:37 PDT 2008

Yes, but that has performance problems since a collection range tests look like
   if (foo > 1 && foo < 10 )
       <do something>
   else if (foo == 13)
       <do something else>
   else ...

and that can get very slow when you have to do all of the tests to find out that a character is not in the set.  It is possible to use a 64KB array to map characters into bytes (there is text where there are more than 256 useful characters, but that requires a mix of languages), but that layer of indirection is also undesirable.

I suspect that there is a clever solution, but I haven't thought of one.

--Loring

----- Original Message ----
> From: Gavin Lambert <antlr at mirality.co.nz>
> To: Loring Craymer <lgcraymer at yahoo.com>; Terence Parr <parrt at cs.usfca.edu>; antlr-interest at antlr.org
> Sent: Wednesday, August 6, 2008 12:58:42 PM
> Subject: Re: [antlr-interest] Another parsing question
> 
> At 02:10 7/08/2008, Loring Craymer wrote:
> >I think that much of this discussion would be moot if ANTLR 3 
> >lexers had the capabilities of ANTLR  2 lexers; unfortunately, 
> >that requires an efficient way of doing FOLLOW sets for unicode 
> >ranges--and no solution has yet presented itself for that.
> 
> Can't you just use an algorithmic test (similar to how sempreds 
> work)?  Obviously the standard table/bitset-based solution won't 
> work for Unicode (at least not without generating very large 
> bitsets [and by "very large" I mean that to represent the full 
> UTF-32 range would require a bitset taking 512MB... and that's 
> just a single follow set])  Whereas expressing the same thing in 
> code should be much more compact, since there are likely to be 
> large contiguous ranges.  And it'd have the added bonus of being 
> more readable, too.
> 
> (Of course, ANTLR might still need to hold that 512MB bitset in 
> memory while compiling the grammar, depending on how it works the 
> set out -- and possibly more than one.  But this should be less of 
> a burden than trying to do it at runtime for every rule.)