You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Most alternatives to Roaring are part of a larger family of compressed bitmaps that are run-length-encoded
117
+
bitmaps. They identify long runs of 1s or 0s and they represent them with a marker word.
118
+
If you have a local mix of 1s and 0, you use an uncompressed word.
119
+
120
+
There are many formats in this family:
121
+
122
+
* Oracle's BBC is an obsolete format at this point: though it may provide good compression,
123
+
it is likely much slower than more recent alternatives due to excessive branching.
124
+
* WAH is a patented variation on BBC that provides better performance.
125
+
* Concise is a variation on the patented WAH. It some specific instances, it can compress
126
+
much better than WAH (up to 2x better), but it is generally slower.
127
+
* EWAH is both free of patent, and it is faster than all the above. On the downside, it
128
+
does not compress quite as well. It is faster because it allows some form of "skipping"
129
+
over uncompressed words. So though none of these formats are great at random access, EWAH
130
+
is better than the alternatives.
131
+
132
+
133
+
134
+
There is a big problem with these formats however that can hurt you badly in some cases: there is no random access. If you want to check whether a given value is present in the set, you have to start from the beginning and "uncompress" the whole thing. This means that if you want to intersect a big set with a large set, you still have to uncompress the whole big set in the worst case...
135
+
136
+
Roaring solves this problem. It works in the following manner. It divides the data into chunks of 2<sup>16</sup> integers
137
+
(e.g., [0, 2<sup>16</sup>), [2<sup>16</sup>, 2 x 2<sup>16</sup>), ...). Within a chunk, it can use an uncompressed bitmap, a simple list of integers,
138
+
or a list of runs. Whatever format it uses, they all allow you to check for the present of any one value quickly
139
+
(e.g., with a binary search). The net result is that Roaring can compute many operations much faster than run-length-encoded
140
+
formats like WAH, EWAH, Concise... Maybe surprisingly, Roaring also generally offers better compression ratios.
0 commit comments