一. bitmaps 是干什么的?
- bitmap 是一个比特数组:
Array[Byte]
, 用来存储整数集合:Set[Integer]
.它通过"如果集合中有一个整数n,就设置arr[n]=1 bit"来存放整数. - 由于 bitmap 的这种表达整数的方式, 它可以利用 cpu 的 bitwise-and (按位与) 和 bitwise-or (按位或) 很快的进行"2个整数集合求交集,并集"操作, 时间复杂度O(1)
假设有10亿个文档, 编号从 1 到 10亿.现在要算出同时存在单词 carrier 和单词 pigeon 的文档该怎么做?
可以分别将存在单词 carrier 的文档编号集合用arr1
:Array[Byte] 表示, 存在单词 pigeon 的文档编号集合用arr2
:Array[Byte] 表示; 同时存在两个单词的文档集合就是将这两个比特数组按位与 - 普通的 bitmaps 有一个缺陷: 当整数数组最大值很大, 但是元素个数却很少时, 会造成巨量的空间浪费.
比如: [1,1000000000] 这个数组, 只有2个整数, 却要用 10亿 个bit的空间表示这个整数数组
二. Roaring bitmaps 是干什么的?
Roaring bitmaps 在传统 bitmaps 上, 使用压缩解决数组稀疏
问题.具体上讲, Roaring bitmaps 将1个 32 位整数集合, 按照高 16 位分桶(container),最多可分 个桶. 存储整数时,按照整数的高16位找到container(找不到就会新建一个),再将整数的低16位放入 container 中. 常见的 container 有一下2类:
ArrayContainer
当桶内数据的个数不大于4096时,会采用它来存储,其本质上是一个unsigned short类型(正好 16 位)的有序数组
:Array[Short]。数组初始长度为4,随着数据的增多会自动扩容(但数组的最大长度就是4096, 即 ArrayContainer 最大占用从初始的 4 * 2B=8B, 到最大 4096 * 2B = 8KB)。另外还维护有一个计数器,用来实时记录基数。BitmapContainer
当桶内数据的个数大于4096时,会采用它来存储,其本质上是长度固定为 位(8KB)的传统 bitmap (存储 个整数) 1物理表现为长度固定为 1024 的 unsigned long型(64位,8B)数组
:Array[Long] (size=1024),亦即这些位图的大小固定 8KB。它同样有一个计数器。
三. Roaring bitmaps 的 exist, union, intersect 如何计算?
- 判断整数 N 是否存在集合中
To check if an integer N exists, get N's 16 most significant bits (N / 2^16) and use it to find N's corresponding container in the Roaring bitmap.
If the container doesn't exist, then N is not in the Roaring bitmap.
Checking for existence in array and bitmap containers works differently:
Bitmap: check if the bit at N % 2^16 is set.
Array: use binary search to find N % 2^16 in the sorted array.
Intersect matching containers to intersect two Roaring bitmaps. Algorithms vary by container type(s), and container types may change.
- 计算 intersect
To intersect Roaring bitmaps A and B, it is sufficient to intersect matching containers in A and B.
This is possible because of how integers are partitioned in Roaring bitmaps: matching containers in A and B store integers with the same 16 most significant bits (the same chunks).
Intersection algorithms vary by the types of the containers involved, as do the resulting container types:
Bitmap / Bitmap: Compute the bitwise AND of the two bitmaps. If the cardinality is <= 4,096, store the result in an array container, otherwise store it in a bitmap container.
Bitmap / Array: Iterate over the array, checking for the existence of each 16-bit integer in the bitmap. If the integer exists, add it to the resulting array container – note that intersections of bitmap and array container types will always create an array container.
Array / Array: Intersections of two array containers always create a new array container. The algorithm used to compute the intersection varies by a cardinality heuristic described at the bottom of page 5 here. It will either be a simple merge (as used in merge sort) or a galloping intersection, described in this paper.
If there is a container in either Roaring bitmap without a corresponding container in the other, it will not exist in the result: the intersection of an empty set and any set is an empty set.
- 计算 union
Union matching containers to produce a Roaring bitmap union. Algorithms vary by container type(s), and container types may change.
To union Roaring bitmaps A and B, union all matching containers in A and B.
Union algorithms vary by the container types involved, as do the resulting container types:
Bitmap / Bitmap: Compute the bitwise OR of the two bitmaps. Unions of two bitmap containers will always create another bitmap container.
Bitmap / Array: Copy the bitmap and set corresponding bits for all the integers in the array container. Unions of a bitmap and array container will always create another bitmap container.
Array / Array: If the sum of the cardinalities of the two array containers is <= 4,096, the resulting container will be an array container. In this case, add all integers from both arrays to a new array container. Otherwise, optimistically assume the resulting container will be a bitmap: create a new bitmap container and set all corresponding bits for all integers in both arrays. If the cardinality of the resulting container is <= 4,096, convert the bitmap container back into an array container.
Finally, add all containers in A and B that do not have a matching container to the result. Remember: this is a union, so all integers in Roaring bitmaps A and B must be in the resulting set.