如何快速获取python中所有集合的交集 [英] How to get all intersections of sets in python fast

查看:534
本文介绍了如何快速获取python中所有集合的交集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在python中计算有限整数集(此处实现为列表列表)的集合的所有(不同)交集(为避免混淆,正式定义在问题末尾):

I would like to compute all (different) intersections of a collection of finite sets of integers (here implemented as a list of lists) in python (to avoid confusion, a formal definition is at the end of the question):

> A = [[0,1,2,3],[0,1,4],[1,2,4],[2,3,4],[0,3,4]]
> all_intersections(A) # desired output
[[], [0], [1], [2], [3], [4], [0, 1], [0, 3], [0, 4], [1, 2], [1, 4], [2, 3], [2, 4], [3, 4], [0, 1, 4], [0, 3, 4], [1, 2, 4], [2, 3, 4], [0, 1, 2, 3]]

我有一个算法可以迭代,但是它很慢(我应该发布吗?),一个测试用例是

I have an algorithm that does it iteratively, but it is rather slow (should I post it?), a test case would be

[[0, 1, 2, 3, 4, 9], [0, 1, 4, 5, 6, 10], [0, 2, 4, 5, 7, 11], [1, 3, 4, 6, 8, 12], [2, 3, 4, 7, 8, 13], [4, 5, 6, 7, 8, 14], [0, 1, 9, 10, 15, 16], [0, 2, 9, 11, 15, 17], [1, 3, 9, 12, 16, 18], [2, 3, 9, 13, 17, 18], [9, 15, 16, 17, 18, 19], [0, 5, 10, 11, 15, 20], [1, 6, 10, 12, 16, 21], [10, 15, 16, 19, 20, 21], [5, 6, 10, 14, 20, 21], [11, 15, 17, 19, 20, 22], [5, 7, 11, 14, 20, 22], [2, 7, 11, 13, 17, 22], [7, 8, 13, 14, 22, 23], [3, 8, 12, 13, 18, 23], [13, 17, 18, 19, 22, 23], [14, 19, 20, 21, 22, 23], [6, 8, 12, 14, 21, 23], [12, 16, 18, 19, 21, 23]]

这需要我大约2.5秒的时间mpute。

which takes me about 2.5 secs to compute.


有什么想法可以快速完成吗?

Any ideas how to do it fast?

形式定义(实际上没有胶乳模式很难):令A = {A1,...,An}是非负整数的有限集Ai的有限集。然后,输出应为集合{B中集合的交集:A的B子集}。

Formal definition (actually hard without latex mode): let A = {A1,...,An} be a finite set of finite sets Ai of non-negative integers. The output should then be the set { intersection of the sets in B : B subset of A }.

因此,形式化算法将采用

So the formal algorithm would be to take the union of all intersections of all subsets of A. But that's clearly taking forever.

非常感谢!

推荐答案

这里是递归的解决方案。在您的测试示例中,这几乎是瞬时的:

Here is a recursive solution. It is almost instantaneous on your test example:

def allIntersections(frozenSets):
    if len(frozenSets) == 0:
        return []
    else:
        head = frozenSets[0]
        tail = frozenSets[1:]
        tailIntersections = allIntersections(tail)
        newIntersections = [head]
        newIntersections.extend(tailIntersections)
        newIntersections.extend(head & s for s in tailIntersections)
        return list(set(newIntersections))

def all_intersections(lists):
    sets = allIntersections([frozenset(s) for s in lists])
    return [list(s) for s in sets]

在编辑中这是相同概念的更简洁,非递归的实现。

On Edit Here is a cleaner, nonrecursive implementation of the same ideas.

如果将一组空集合的交集定义为通用集,那么这个问题最简单,并且可以通过取所有的并集来获得足够的通用集元素。这是晶格理论中的标准动作,对于将集合的空集合的并集作为空集合是双重的。如果您不希望使用此通用集,总是可以将其丢弃:

The problem is easiest if you define the intersection of an empty collection of sets to be the universal set, and an adequate universal set can be obtained by taking the union of all elements. This is a standard move in lattice-theory, and is dual to taking the union of an empty collection of sets to be the empty set. You could always throw away this universal set if you don't want it:

def allIntersections(frozenSets):
    universalSet = frozenset.union(*frozenSets)
    intersections = set([universalSet])
    for s in frozenSets:
        moreIntersections = set(s & t for t in intersections)
        intersections.update(moreIntersections)
    return intersections

def all_intersections(lists):
    sets = allIntersections([frozenset(s) for s in lists])
    return [list(s) for s in sets]

测试如此之快的原因例如,即使您的集合有24个集合,因此具有2 ** 24(1,680万个)潜在的交集,实际上实际上只有242个(如果不计入空交集,则为241个)不同的交集。因此,每次通过循环的相交点数最多只有几百个。

The reason that this is so fast with your test example is that, even though your collection has 24 sets, hence having 2**24 (16.8 million) potential intersections, there are in fact only 242 (or 241 if you don't count the empty intersection) distinct intersections. Thus the number of intersections in each pass through the loop is in the low hundreds at most.

可以选择24组,以便所有2 ** 24可能的交点实际上是不同的,因此很容易看出最坏情况的行为是指数的。但是,如果像在您的测试示例中那样,交点的数量很少,这种方法将使您能够快速计算它们。

It is possible to pick 24 sets so that all of the 2**24 possible intersections are in fact different, so it is easy to see that the worst-case behavior is exponential. But if, as in your test example, the number of intersections is small, this approach will allow you to rapidly compute them.

潜在的优化可能是对集合进行排序在您遍历它们之前增加大小。处理较小的集合可能会导致更早出现空的交点,从而使不同交点的总数较小,直到循环结束为止。

A potential optimization might be to sort the sets in increasing size before you loop over them. Processing the smaller sets up front might result in more empty intersections appearing earlier, thus keeping the total number of distinct intersections smaller until towards the end of the loop.

这篇关于如何快速获取python中所有集合的交集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆