Skip to content

Commit c39a223

Browse files
authored
Update README.md
1 parent cd6117d commit c39a223

File tree

1 file changed

+87
-0
lines changed

1 file changed

+87
-0
lines changed

README.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,93 @@ This code is licensed under Apache License, Version 2.0 (ASL2.0).
5555

5656
Copyright 2016-... by the authors.
5757

58+
When should you use a bitmap?
59+
===================================
60+
61+
62+
Sets are a fundamental abstraction in
63+
software. They can be implemented in various
64+
ways, as hash sets, as trees, and so forth.
65+
In databases and search engines, sets are often an integral
66+
part of indexes. For example, we may need to maintain a set
67+
of all documents or rows (represented by numerical identifier)
68+
that satisfy some property. Besides adding or removing
69+
elements from the set, we need fast functions
70+
to compute the intersection, the union, the difference between sets, and so on.
71+
72+
73+
To implement a set
74+
of integers, a particularly appealing strategy is the
75+
bitmap (also called bitset or bit vector). Using n bits,
76+
we can represent any set made of the integers from the range
77+
[0,n): the ith bit is set to one if integer i is present in the set.
78+
Commodity processors use words of W=32 or W=64 bits. By combining many such words, we can
79+
support large values of n. Intersections, unions and differences can then be implemented
80+
as bitwise AND, OR and ANDNOT operations.
81+
More complicated set functions can also be implemented as bitwise operations.
82+
83+
When the bitset approach is applicable, it can be orders of
84+
magnitude faster than other possible implementation of a set (e.g., as a hash set)
85+
while using several times less memory.
86+
87+
However, a bitset, even a compressed one is not always applicable. For example, if the
88+
you have 1000 random-looking integers, then a simple array might be the best representation.
89+
We refer to this case as the "sparse" scenario.
90+
91+
When should you use compressed bitmaps?
92+
===================================
93+
94+
An uncompressed BitSet can use a lot of memory. For example, if you take a BitSet
95+
and set the bit at position 1,000,000 to true and you have just over 100kB. That is over 100kB
96+
to store the position of one bit. This is wasteful even if you do not care about memory:
97+
suppose that you need to compute the intersection between this BitSet and another one
98+
that has a bit at position 1,000,001 to true, then you need to go through all these zeroes,
99+
whether you like it or not. That can become very wasteful.
100+
101+
This being said, there are definitively cases where attempting to use compressed bitmaps is wasteful.
102+
For example, if you have a small universe size. E.g., your bitmaps represent sets of integers
103+
from [0,n) where n is small (e.g., n=64 or n=128). If you are able to uncompressed BitSet and
104+
it does not blow up your memory usage, then compressed bitmaps are probably not useful
105+
to you. In fact, if you do not need compression, then a BitSet offers remarkable speed.
106+
107+
The sparse scenario is another use case where compressed bitmaps should not be used.
108+
Keep in mind that random-looking data is usually not compressible. E.g., if you have a small set of
109+
32-bit random integers, it is not mathematically possible to use far less than 32 bits per integer,
110+
and attempts at compression can be counterproductive.
111+
112+
How does Roaring compares with the alternatives?
113+
==================================================
114+
115+
116+
Most alternatives to Roaring are part of a larger family of compressed bitmaps that are run-length-encoded
117+
bitmaps. They identify long runs of 1s or 0s and they represent them with a marker word.
118+
If you have a local mix of 1s and 0, you use an uncompressed word.
119+
120+
There are many formats in this family:
121+
122+
* Oracle's BBC is an obsolete format at this point: though it may provide good compression,
123+
it is likely much slower than more recent alternatives due to excessive branching.
124+
* WAH is a patented variation on BBC that provides better performance.
125+
* Concise is a variation on the patented WAH. It some specific instances, it can compress
126+
much better than WAH (up to 2x better), but it is generally slower.
127+
* EWAH is both free of patent, and it is faster than all the above. On the downside, it
128+
does not compress quite as well. It is faster because it allows some form of "skipping"
129+
over uncompressed words. So though none of these formats are great at random access, EWAH
130+
is better than the alternatives.
131+
132+
133+
134+
There is a big problem with these formats however that can hurt you badly in some cases: there is no random access. If you want to check whether a given value is present in the set, you have to start from the beginning and "uncompress" the whole thing. This means that if you want to intersect a big set with a large set, you still have to uncompress the whole big set in the worst case...
135+
136+
Roaring solves this problem. It works in the following manner. It divides the data into chunks of 2<sup>16</sup> integers
137+
(e.g., [0, 2<sup>16</sup>), [2<sup>16</sup>, 2 x 2<sup>16</sup>), ...). Within a chunk, it can use an uncompressed bitmap, a simple list of integers,
138+
or a list of runs. Whatever format it uses, they all allow you to check for the present of any one value quickly
139+
(e.g., with a binary search). The net result is that Roaring can compute many operations much faster than run-length-encoded
140+
formats like WAH, EWAH, Concise... Maybe surprisingly, Roaring also generally offers better compression ratios.
141+
142+
143+
144+
58145

59146
### References
60147

0 commit comments

Comments
 (0)