快速检查集合是否是存储集合的超集 [英] Quickly checking if set is superset of stored sets
问题描述
我得到了 N 个 C 布尔值数组.我想将这些组织成一个数据结构,使我能够尽快执行以下操作:给定一个新数组,如果该数组是任何存储数组的超集",则返回 true.对于超集,我的意思是:A 是 B 的超集,如果 A[i] 对每个 i 都为真,其中 B[i] 为真.如果 B[i] 是假的,那么 A[i] 可以是任何东西.
I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.
或者,就集合而不是数组而言:
Or, in terms of sets instead of arrays:
将 N 个集合(每个集合包含 C 个可能的元素)存储到一个数据结构中,以便您可以快速查找给定的集合是否是任何存储集合的超集.
构建数据结构可以花费尽可能长的时间,但查找应该尽可能高效,并且数据结构不能占用太多空间.
Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.
我认为这本身就是一个有趣的问题,但对于我真正想解决的问题,您可以假设如下:
I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:
- N = 10000
- C = 1000
- 存储的数组是稀疏的
- 查找的数组是随机的(所以不是稀疏的)
对于 O(NC) 查找:只需迭代所有数组.不过这太慢了.
For O(NC) lookup: Just iterate all the arrays. This is just too slow though.
对于 O(C) 查找:我在这里有很长的描述,但正如 Amit 在评论中指出的那样,它基本上是一个 BDD.虽然这具有很高的查找速度,但它的节点数量呈指数级增长.由于 N 和 C 如此之大,这会占用太多空间.
For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.
我希望在这个 O(N*C) 和 O(C) 解决方案之间,可能有一个不需要指数级空间的 O(log(N)*C) 解决方案.
I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.
对于 O(sqrt(N)C) 查找:将数组存储为 前缀试试.查找数组 A 时,如果 A[i]=0,则转到相应的子树,如果 A[i]=1,则访问两个子树.
For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.
我的直觉告诉我,如果您假设存储的数组是随机的,那么这应该使查找的(平均)复杂度变为 O(sqrt(N)C).但是: 1. 他们不是,数组是稀疏的.2. 这只是直觉,我无法证明.
My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.
我将同时尝试这个新想法和 BDD 方法,看看两者中哪一个最有效.
I will try out both this new idea and the BDD method, and see which of the 2 work out best.
但与此同时,这个问题不是经常出现吗?它没有名字吗?之前不是有研究吗?感觉就像我在这里重新发明轮子一样.
But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.
推荐答案
只是为了给prefix trie解决方案补充一些背景资料,最近发现了如下论文:
Just to add some background information to the prefix trie solution, recently I found the following paper:
I.Savnik:用于快速子集和超集查询的索引数据结构.CD-ARES,IFIP LNCS,2013 年.
论文提出了 set-trie 数据结构(容器),它为使用 trie 数据结构的集合集的高效存储和查询提供支持,支持查找所有超集/子集等操作集合集合中的给定集合.
The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.
对于任何对实际实现感兴趣的 python 用户,我想出了一个部分基于上述论文的 python3 包.它包含一个基于 trie 的集合容器和一个映射容器,其中键是集合.您可以在 github 上找到它.
For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.
这篇关于快速检查集合是否是存储集合的超集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!