快速检查集合是否是存储集合的超集 [英] Quickly checking if set is superset of stored sets

查看:31
本文介绍了快速检查集合是否是存储集合的超集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了 N 个 C 布尔值数组.我想将这些组织成一个数据结构,使我能够尽快执行以下操作:给定一个新数组,如果该数组是任何存储数组的超集",则返回 true.对于超集,我的意思是:A 是 B 的超集,如果 A[i] 对每个 i 都为真,其中 B[i] 为真.如果 B[i] 是假的,那么 A[i] 可以是任何东西.

I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.

或者,就集合而不是数组而言:

Or, in terms of sets instead of arrays:

将 N 个集合(每个集合包含 C 个可能的元素)存储到一个数据结构中,以便您可以快速查找给定的集合是否是任何存储集合的超集.

构建数据结构可以花费尽可能长的时间,但查找应该尽可能高效,并且数据结构不能占用太多空间.

Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.

我认为这本身就是一个有趣的问题,但对于我真正想解决的问题,您可以假设如下:

I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:

  • N = 10000
  • C = 1000
  • 存储的数组是稀疏的
  • 查找的数组是随机的(所以不是稀疏的)
  1. 对于 O(NC) 查找:只需迭代所有数组.不过这太慢了.

  1. For O(NC) lookup: Just iterate all the arrays. This is just too slow though.

对于 O(C) 查找:我在这里有很长的描述,但正如 Amit 在评论中指出的那样,它基本上是一个 BDD.虽然这具有很高的查找速度,但它的节点数量呈指数级增长.由于 N 和 C 如此之大,这会占用太多空间.

For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.

我希望在这个 O(N*C) 和 O(C) 解决方案之间,可能有一个不需要指数级空间的 O(log(N)*C) 解决方案.

I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.

  • 对于 O(sqrt(N)C) 查找:将数组存储为 前缀试试.查找数组 A 时,如果 A[i]=0,则转到相应的子树,如果 A[i]=1,则访问两个子树.

  • For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.

我的直觉告诉我,如果您假设存储的数组是随机的,那么这应该使查找的(平均)复杂度变为 O(sqrt(N)C).但是: 1. 他们不是,数组是稀疏的.2. 这只是直觉,我无法证明.

My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.

我将同时尝试这个新想法和 BDD 方法,看看两者中哪一个最有效.

I will try out both this new idea and the BDD method, and see which of the 2 work out best.

但与此同时,这个问题不是经常出现吗?它没有名字吗?之前不是有研究吗?感觉就像我在这里重新发明轮子一样.

But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.

推荐答案

只是为了给prefix trie解决方案补充一些背景资料,最近发现了如下论文:

Just to add some background information to the prefix trie solution, recently I found the following paper:

I.Savnik:用于快速子集和超集查询的索引数据结构.CD-ARES,IFIP LNCS,2013 年.

论文提出了 set-trie 数据结构(容器),它为使用 trie 数据结构的集合集的高效存储和查询提供支持,支持查找所有超集/子集等操作集合集合中的给定集合.

The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.

对于任何对实际实现感兴趣的 python 用户,我想出了一个部分基于上述论文的 python3 包.它包含一个基于 trie 的集合容器和一个映射容器,其中键是集合.您可以在 github 上找到它.

For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.

这篇关于快速检查集合是否是存储集合的超集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆