查找包含其他集合中至少一个元素的集合 [英] Find sets that contain at least one element from other sets
问题描述
假设给定了 n 个集合,并希望构造所有最小集,每个输入集至少具有一个共同点。如果没有作为 S 子集的允许集合 S',则 S 集称为极小值。
Suppose we are given n sets and want to construct all minimal sets that have at least one element in common with each of the input sets. A set S is called minimal, if there is no admissible set S' that is a subset of S.
一个例子:
In: s1 = {1, 2, 3}; s2 = {3, 4, 5}; s3 = {5, 6}
Out: [{1, 4, 6}, {1, 5}, {2, 4, 6}, {2, 5}, {3, 5}, {3, 6}]
我的想法是将一组之后的另一组迭代添加解决方案:
My idea was to iteratively add one set after the other to the solution:
result = f(s1, f(s2, f(s3, ...)))
其中, f
是一个合并函数,看起来可能如下:
whereby f
is a merge function that could look as follows:
function f(newSet, setOfSets):
Step 1:
return all elements of setOfSets that share an element with newSet
Step 2:
for each remaining element setE of setOfSets:
for each element e of newSet:
return union(setE, {e})
上述方法的问题在于,在步骤2中计算出的笛卡尔积可能包含集合的超集在步骤1中返回。我正在考虑遍历所有已返回的集合(请参阅查找覆盖给定集合的最小子集),但这似乎过于复杂且效率低下,我希望在我的特殊情况下有更好的解决方案。
The issue with the above appraoch is that the cartesian product computed in step 2 may contain supersets of sets returned in step 1. I was thinking of going through all already returned sets (see Find minimal set of subsets that covers a given set), but this seems to be too complicated and inefficient, and I hope that there is a better solution in my special case.
在不确定第2步中完整笛卡尔积的情况下如何实现目标?
请注意,此问题与仅找到最小集合的问题,但是我需要找到以上述方式最小的所有集合。我知道解决方案的数量将不是多项式。
Note that this question is related to the question of finding the smallest set only, but I need to find all sets that are minimal in the way specified above. I am aware that the number of solutions will not be polynomial.
输入集的数量 n 将是多个Hundret,但是这些集合包含只有范围有限的元素(例如大约20个不同的值),这也限制了集合的大小。如果算法在 O(n ^ 2)中运行,这是可以接受的,但是它应该基本上与输出集呈线性关系(可能具有对数乘数)。
The number n of input sets will be several hundret, but the sets contain only elements from a limited range (e.g. about 20 different values), which also limits the sets' sizes. It would be acceptible if the algorithm runs in O(n^2), but it should be basically linear (maybe with a log multiplier) of the output sets.
推荐答案
我已经提出了一个基于trie数据结构的解决方案,如。通过尝试,可以相对快速地确定其中一个存储集是否是另一个给定集合的子集( Savnik,2013 )。
I have come up with a solution based on the trie data structure as described here. Tries make it relatively fast to determine whether one of the stored sets is a subset of another given set (Savnik, 2013).
然后解决方案如下:
- 创建特里
- 遍历给定的集合
- 在每次迭代中,遍历Trie中的集合并检查它们是否与新集。
- 如果是,请继续;如果不是,则将相应的新集合添加到trie中,除非它们是trie中集合的超集。
最坏情况下的运行时是 O(nmc),如果仅考虑 n'< =,则 m 是解决方案的最大数量输入集中的n 个,而 c 是子集查找的时间因子。
The worst-case runtime is O(n m c), whereby m is the maximal number of solutions if we consider only n' <= n of the input sets, and c is the time factor from the subset lookups.
代码如下。我已经基于python包 datrie 实现了该算法,该包是对C的有效C实现的包装一个特里。下面的代码在cython中,但是可以通过删除/交换cython特定命令轻松地转换为纯python。
The code is below. I have implemented the algorithm based on the python package datrie, which is a wrapper around an efficent C implementation of a trie. The code below is in cython but can be converted to pure python easily by removing/exchangin cython specific commands.
扩展的trie实现:
from datrie cimport BaseTrie, BaseState cdef short has_subset_c(BaseTrie trie, BaseState trieState, str setarr, int index, int size): cdef BaseState trieState2 = BaseState(trie) cdef int i trieState.copy_to(trieState2) for i in range(index, size): if trieState2.walk(setarr[i]): if trieState2.is_leaf() or has_subset_c(trie, trieState2, setarr, i, size): return True trieState.copy_to(trieState2) return False cdef class SetTrie(): def __init__(self, alphabet, initSet=[]): if not hasattr(alphabet, "__iter__"): alphabet = range(alphabet) self.trie = BaseTrie("".join(chr(i) for i in alphabet)) for i in initSet: self.trie[chr(i)] = 0 def has_subset(self, subset): cdef BaseState trieState = BaseState(self.trie) setarr = [chr(i) for i in subset] return bool(has_subset_c(self.trie, trieState, setarr, 0, len(subset))) def extend(self, sets): for s in sets: self.trie["".join(chr(i) for i in s)] = 0 def update(self, otherSet): otherSet = set(chr(i) for i in otherSet) cdef str subset, newSubset, elem cdef list disjointList = [] cdef BaseTrie trie = self.trie for subset in trie: if otherSet.isdisjoint(subset): disjointList.append(subset) cdef BaseState trieState = BaseState(self.trie) for subset in disjointList: trie.pop(subset) for elem in otherSet: newSubset = subset + elem trieState.rewind() if not has_subset_c(self.trie, trieState, newSubset, 0, len(newSubset)): trie[newSubset] = 0 def get_frozensets(self): return (frozenset(ord(t) for t in subset) for subset in self.trie)
可以按以下方式使用:
def cover_sets(sets): strie = SetTrie(range(10), *([i] for i in sets[0])) for s in sets[1:]: strie.update(s) return strie.get_frozensets()
定时:
from timeit import timeit s1 = {1, 2, 3} s2 = {3, 4, 5} s3 = {5, 6} %timeit cover_sets([s1, s2, s3])
结果:
37.8 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
这篇关于查找包含其他集合中至少一个元素的集合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!