查找所有最大子集的高效算法 [英] Efficient algorithm for finding all maximal subsets

查看:13
本文介绍了查找所有最大子集的高效算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组独特的集合(表示为位掩码),并希望消除作为另一个元素的真子集的所有元素.例如:

I have a collection of unique sets (represented as bit masks) and would like to eliminate all elements that are proper subsets of another element. For example:

input = [{1, 2, 3}, {1, 2}, {2, 3}, {2, 4}, {}]
output = [{1, 2, 3}, {2, 4}]

我无法为此找到标准算法,甚至找不到此问题的名称,因此我将其称为最大子集",因为没有其他任何内容.这是一个 O(n^2) 算法(具体用 Python),假设 is_subset_func 为 O(1):1

I have not been able to find a standard algorithm for this, or even a name for this problem, so I am calling it "maximal subsets" for lack of anything else. Here is an O(n^2) algorithm (in Python for concreteness), assuming is_subset_func is O(1):1

def eliminate_subsets(a, cardinality_func, is_subset_func):
    out = []
    for element in sorted(a, reverse=True, key=cardinality_func):
        for existing in out:
            if is_subset_func(element, existing):
                break
        else:
            out.append(element)
    return out

是否有更高效的算法,希望 O(n log n) 或更好?

Is there a more efficient algorithm, hopefully O(n log n) or better?

1 对于固定大小的位掩码,在我的情况下也是如此,is_subset_func 只是 element &现有的 == 元素,它以恒定的时间运行.

1 For bit masks of constant size, as is true in my case, is_subset_func is just element & existing == element, which runs in constant time.

推荐答案

假设您标记所有输入集.

Suppose you label all the input sets.

A={1, 2, 3}, B={1, 2}, C={2, 3}, D={2, 4}, E={}

现在构建中间集,宇宙中的每个元素一个,包含它出现的集合的标签:

Now build intermediate sets, one per element in the universe, containing the labels of the sets where it appears:

1={A,B}
2={A,B,C,D}
3={A,C}
4={D}

现在为每个输入集计算其元素的所有标签集的交集:

Now for each input set compute the intersection of all the label sets of its elements:

For A, {A,B} intesect {A,B,C,D} intersect {A,C} = {A}   (*)

如果交集包含集合本身以外的标签,则它是该集合的子集.这里没有其他元素,所以答案是否定的.但是,

If the intersection contains some label other than for the set itself, then it's s a subset of that set. Here there is no other element, so the answer is no. But,

For C, {A,B,C,D} intersect {A,C} = {A,C}, which means that it's a subset of A.

此方法的成本取决于集合的实现.假设位图(如您所暗示的).假设有 n 个最大大小为 m 和 |U| 的输入集宇宙中的物品.然后中间集构造产生|U|大小为 n 位的集合,因此有 O(|U|n) 时间来初始化它们.设置这些位需要 O(nm) 时间.计算上面 (*) 处的每个交点需要 O(mn);全部为 O(mn^2).

The cost of this method depends on the implementation of sets. Suppose bitmaps (as you hinted). Say there are n input sets of maximum size m and |U| items in the universe. Then the intermediate set construction produces |U| sets of size n bits, so there is O(|U|n) time to initialize them. Setting the bits requires O(nm) time. Computing each intersection as at (*) above requires O(mn); O(mn^2) for all.

将所有这些放在一起,我们有 O(|U|n) + O(nm) +O(mn^2) = O(|U|n + mn^2).使用相同的约定,您的所有对"算法为 O(|U|^2 n^2).由于 m <= |U|,该算法渐近地更快.在实践中它也可能更快,因为没有复杂的簿记来添加常数因子.

Putting all these together we have O(|U|n) + O(nm) +O(mn^2) = O(|U|n + mn^2). Using the same conventions, your "all pairs" algorithm is O(|U|^2 n^2). Since m <= |U|, this algorithm is asymptotically faster. It's likely to be faster in practice as well because there's no elaborate bookkeeping to add constant factors.

补充:在线版

OP 询问是否有该算法的在线版本,即当输入集一个接一个到达时,最大集合的集合可以增量维护.答案似乎是肯定的.中间集快速告诉我们一个新集是否是一个已经见过的子集.但是如何快速判断它是否是超集?如果是这样,哪些现有的最大集合?因为在这种情况下,那些最大集合不再是最大的,必须用新的集合来代替.

The OP asked if there is an online version of this algorithm, i.e. one where the set of maximal sets can be maintained incrementally as input sets arrive one-by-one. The answer seems to be yes. The intermediate sets tell us quickly if a new set is a subset of one already seen. But how to tell quickly if it's a superset? And, if so, of which existing maximal sets? For in this case those maximal sets are no longer maximal and must be replaced by the new one.

关键是要注意 AB 的超集 iff A'B' 的子集code>(勾号'表示集合补码).

The key is to note that A is a superset of B iff A' is a subset of B' (the tick' denoting set complement).

根据这个灵感,我们像以前一样保持中间集.当一个新的输入集 S 到达时,进行与上述相同的测试:令 I(e) 为输入元素 e 的中间集.那么这个测试是

Following this inspiration, we maintain the intermediate set as before. When a new input set S arrives, do the same test as described above: Let I(e) be the intermediate set for input element e. Then this test is

For X = intersect_{e in S} . I(e), |X| > 0

(在这种情况下,它大于零而不是上面的一,因为 S 尚未在 I 中.)如果测试成功,则新集合是(可能不正确的)现有最大集合的子集,因此可以将其丢弃.

(In this case it's greater than zero rather than one as above because S is not yet in I.) If the test succeeds, then the new set is a (possibly improper) subset of an existing maximal set, so it can be discarded.

否则我们必须添加 S 作为新的最大集,但在此之前,计算:

Otherwise we must add S as a new maximal set, but before doing this, compute:

Y = intersect_{e in S'} . I'(e) = ( union_{e in S'} . I(e) )'

再次将勾号设置为补码.联合形式的计算速度可能会快一些.Y 包含已被 S 取代的最大集合.它们必须从最大集合和 I 中删除.最后添加 S 作为最大集合并用 S 的元素更新 I.

where again the tick' is set complement. The union form may be a bit faster to compute. Y contains the maximal sets that have been superceded by S. They must be removed from the maximal collection and from I. Finally add S as a maximal set and update I with S's elements.

让我们来看看我们的例子.当 A 到达时,我们将它添加到 I 并拥有

Let's work through our example. When A arrives, we add it to I and have

1={A}  2={A}  3={A}

B到达时,我们发现X = {A} intersect {A} = {A},所以把B扔掉继续.C 也是如此.当 D 到达时,我们发现 X = {A} intersect {} = {},所以继续 Y = I'(1) intersect I'(3)= {} 相交 {}.这正确地告诉我们最大集合 A 不包含在 D 中,因此没有什么可删除的.但它必须作为一个新的最大集合添加,并且 I 变成了

When B arrives, we find X = {A} intersect {A} = {A}, so throw B away and continue. The same happens for C. When D arrives we find X = {A} intersect {} = {}, so continue with Y = I'(1) intersect I'(3) = {} intersect {}. This correctly tells us that maximal set A is not contained in D, so there is nothing to delete. But it must be added as a new maximal set, and I becomes

1={A}  2={A,D}  3={A}  4={D}

E 的到来不会引起任何变化.假设一个新集合 F={2, 3, 4, 5} 的到来.我们发现

The arrival of E causes no change. Posit the arrival then of a new set F={2, 3, 4, 5}. We find

X = {A} isect {A,D} isect {A} isect {D} isect {}

所以我们不能把 F 扔掉.继续

so we cannot throw F away. Continue with

Y = intersect_{e in {1}} I'(e) = I'(1) = {D}

这告诉我们 DF 的子集,所以应该在添加 F 时丢弃,留下

This tells us D is a subset of F, so should be discarded while F is added, leaving

1={A} 2={A,F} 3={A,F} 4={F} 5={F}

由于算法的在线性质,补码的计算既棘手又漂亮.输入补语的域只需要包括到目前为止看到的输入元素.中间集的全域仅包含当前最大集合中的集标签.对于许多输入流,该集合的大小会随着时间的推移而稳定或减小.

The computation of the complements is both tricky and nice due to the algorithm's online nature. The universe for input complements need only include input elements seen so far. The universe for intermediate sets consists only of tags of sets in the current maximal collection. For many input streams the size of this set will stabilize or decrease over time.

我希望这会有所帮助.

总结

这里起作用的一般原则是算法设计中经常出现的一个强大的想法.是反向地图.每当您发现自己在进行线性搜索以查找具有给定属性的项目时,请考虑构建从属性返回到项目的映射.构建此地图通常很便宜,并且大大减少了搜索时间.首要的例子是一个排列映射 p[i],它告诉你在排列一个数组后第 i 的元素将占据什么位置.如果您需要搜索在给定位置 a 中结束的项目,则必须在 p 中搜索 a,这是一种线性时间操作.另一方面,逆映射 pi 使得 pi[p[i]] == i 的计算时间比 p(因此它的成本是隐藏的"),但 pi[a] 会在恒定时间内产生所需的结果.

The general principle at work here is a powerful idea that crops of often in algorithm design. It's the reverse map. Whenever you find yourself doing a linear search to find an item with a given attribute, consider building a map from the attribute back to item. Often it is cheap to construct this map, and it strongly reduces search time. The premier example is a permutation map p[i] that tells you what position the i'th element will occupy after an array is permuted. If you need to search out the item that ends up in a given location a, you must search p for a, a linear time operation. On the other hand, an inverse map pi such that pi[p[i]] == i takes no longer to compute than does p (so its cost is "hidden"), but pi[a] produces the desired result in constant time.

由原始海报实施

import collections
import operator
from functools import reduce # only in Python 3

def is_power_of_two(n):
    """Returns True iff n is a power of two.  Assumes n > 0."""
    return (n & (n - 1)) == 0

def eliminate_subsets(sequence_of_sets):
    """Return a list of the elements of `sequence_of_sets`, removing all
    elements that are subsets of other elements.  Assumes that each
    element is a set or frozenset and that no element is repeated."""
    # The code below does not handle the case of a sequence containing
    # only the empty set, so let's just handle all easy cases now.
    if len(sequence_of_sets) <= 1:
        return list(sequence_of_sets)
    # We need an indexable sequence so that we can use a bitmap to
    # represent each set.
    if not isinstance(sequence_of_sets, collections.Sequence):
        sequence_of_sets = list(sequence_of_sets)
    # For each element, construct the list of all sets containing that
    # element.
    sets_containing_element = {}
    for i, s in enumerate(sequence_of_sets):
        for element in s:
            try:
                sets_containing_element[element] |= 1 << i
            except KeyError:
                sets_containing_element[element] = 1 << i
    # For each set, if the intersection of all of the lists in which it is
    # contained has length != 1, this set can be eliminated.
    out = [s for s in sequence_of_sets
           if s and is_power_of_two(reduce(
               operator.and_, (sets_containing_element[x] for x in s)))]
    return out

这篇关于查找所有最大子集的高效算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆