有效地找到匹配位掩码的第一个元素 [英] efficiently find the first element matching a bit mask
问题描述
我有一个列表的 N 64位整数,其位重新present小套。每个整数最多有 K 位设置为1,给定一个位掩码,我想找到的面具,即元素与放大器相匹配的列表中的第一个元素;面具==元素
。
I have a list of N 64-bit integers whose bits represent small sets. Each integer has at most k bits set to 1. Given a bit mask, I would like to find the first element in the list that matches the mask, i.e. element & mask == element
.
示例:
如果我的名单是:
index abcdef
0 001100
1 001010
2 001000
3 000100
4 000010
5 000001
6 010000
7 100000
8 000000
和我的面具是 111000
,匹配面具的第一个元素是指数2。
and my mask is 111000
, the first element matching the mask is at index 2.
方法1:
整个列表的线性搜索。这需要O( N )的时间和O(1)空间。
Linear search through the entire list. This takes O(N) time and O(1) space.
方法2:
precompute一棵树,并在每个节点保留该面具的答案。这需要O(1)时间查询,但需要O(2 ^ 64)空间。
Precompute a tree of all possible masks, and at each node keep the answer for that mask. This takes O(1) time for the query, but takes O(2^64) space.
问:
我怎样才能找到匹配的面具快于澳第一个元素( N ),而仍然使用的空间,合理的薪酬?我可以花得起多项式时间在precomputation,因为会有很多的查询。关键是, K 小。在我的应用程序,<强> K &LT; = 5和 N 是成千上万。面膜有很多1秒;可以假定它被均匀地从64位整数的空间引出。
How can I find the first element matching the mask faster than O(N), while still using a reasonable amount of space? I can afford to spend polynomial time in precomputation, because there will be a lot of queries. The key is that k is small. In my application, k <= 5 and N is in the thousands. The mask has many 1s; you can assume that it is drawn uniformly from the space of 64-bit integers.
更新:
下面是一个例子的数据集,并在Linux上运行一个简单的基准测试程序:的http:// up.thirld.com/binmask.tar.gz 。对于 large.in
, N = 3779和 K = 3。第一行是 N ,随后的 N 64位无符号整数重新presenting的元素。编译与让
。与 ./ benchmark.e&GT运行; large.out
以创建真正的输出,然后你就可以差异反对。 (面具是随机生成的,但随机种子是固定的。),然后用你的实现替换 find_first()
的功能。
Here is an example data set and a simple benchmark program that runs on Linux: http://up.thirld.com/binmask.tar.gz. For large.in
, N=3779 and k=3. The first line is N, followed by N unsigned 64-bit ints representing the elements. Compile with make
. Run with ./benchmark.e >large.out
to create the true output, which you can then diff against. (Masks are generated randomly, but the random seed is fixed.) Then replace the find_first()
function with your implementation.
简单的线性搜索的速度远远超过我的预期。这是因为 K 小,所以随机面具,找到一个匹配的速度非常快,平均
The simple linear search is much faster than I expected. This is because k is small, and so for a random mask, a match is found very quickly on average.
推荐答案
一个后缀树(位)会做的伎俩,用在叶节点原来的优先级:
A suffix tree (on bits) will do the trick, with the original priority at the leaf nodes:
000000 -> 8
1 -> 5
10 -> 4
100 -> 3
1000 -> 2
10 -> 1
100 -> 0
10000 -> 6
100000 -> 7
在这里,如果该位在掩码设置,你搜索双臂,如果没有,你只搜索0臂;你的答案是你遇到的叶节点的最低数量。
where if the bit is set in the mask, you search both arms, and if not, you search only the 0 arm; your answer is the minimum number you encounter at a leaf node.
您可以通过遍历不是为了而是由最大的鉴别力位提高本(略);在你的榜样,注意3个元素有2位的设置,所以您将创建
You can improve this (marginally) by traversing the bits not in order but by maximum discriminability; in your example, note that 3 elements have bit 2 set, so you would create
2:0 0:0 1:0 3:0 4:0 5:0 -> 8
5:1 -> 5
4:1 5:0 -> 4
3:1 4:0 5:0 -> 3
1:1 3:0 4:0 5:0 -> 6
0:1 1:0 3:0 4:0 5:0 -> 7
2:1 0:0 1:0 3:0 4:0 5:0 -> 2
4:1 5:0 -> 1
3:1 4:0 5:0 -> 0
在您的例子掩盖这不利于(因为你必须遍历两个位2 == 0和位2 == 1侧,因为你的面具是在第2位设置),但平均会提高的结果(但在设置和更复杂的数据结构的成本)。如果某些位更可能比其他人进行设置,这可能是一个巨大的胜利。如果他们是pretty的接近随机元素列表中,那么这并没有帮助的。
In your example mask this doesn't help (since you have to traverse both the bit2==0 and bit2==1 sides since your mask is set in bit 2), but on average it will improve the results (but at a cost of setup and more complex data structure). If some bits are much more likely to be set than others, this could be a huge win. If they're pretty close to random within the element list, then this doesn't help at all.
如果你坚持基本上是随机的位设置,你应该得到关于(1-5 / 64)^ 32
受惠于平均后缀树方法(13X加速),其中的也许的,由于使用更复杂的操作会比效率更差(但不要指望它 - 位掩码是快)。如果您有位在列表中的非随机分布的,那么你可以做几乎任意很好。
If you're stuck with essentially random bits set, you should get about (1-5/64)^32
benefit from the suffix tree approach on average (13x speedup), which might be better than the difference in efficiency due to using more complex operations (but don't count on it--bit masks are fast). If you have a nonrandom distribution of bits in your list, then you could do almost arbitrarily well.
这篇关于有效地找到匹配位掩码的第一个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!