[antlr-interest] Another parsing question

Thu Aug 7 05:34:48 PDT 2008

At 09:32 7/08/2008, Randall R Schulz wrote:
 > On Wednesday 06 August 2008 14:00, Loring Craymer wrote:
 >> and that can get very slow when you have to do all of
 >> the tests to find out that a character is not in the set.

I did say it'd likely be slower, but I doubt it'd be significantly 
so.  (Looking for presence/absence in a BitSet is an O(1) 
operation, but so is this kind of conditional tree -- and at worst 
case it's only O(n).)

Besides which, small sets could be rendered as switch blocks or a 
simple logical-or expression, which are probably even faster than 
using a BitSet.

 >Prove it.

Indeed.

 >Anyway, why not just use a regular hashed set? The Trove4J 
library
 >has native collections, too.

Some kind of sparse boolean collection would work well (or even 
something more fancy, if you want to indicate the possible 
following rules at end-of-token as well).

But you'll also need some way of specifying whether it's an 
"include" collection or an "exclude" one, so that you only need to 
populate the smaller of the two.  In other words, ~('a' | 'b') 
should be recorded as an "exclude" set of 'a' and 'b', not as an 
"include" set of every character except for 'a' and 'b' (as ANTLR 
v2 did) :)

Although I still think just generating the matching code will end 
up simpler than futzing around with collections.  (The compiler 
itself will need a collection, of course.  But I don't see why the 
runtime lexer would.)