Python中的成对设置相交 [英] Pairwise Set Intersection in Python

查看:150
本文介绍了Python中的成对设置相交的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有可变数量的集合(我们叫数字 n ),每个集合最多具有 m 个元素,那么计算成对的最有效方法是什么对所有集合的交集?请注意,这不同于所有 n 集的交集.

If I have a variable number of sets (let's call the number n), which have at most m elements each, what's the most efficient way to calculate the pairwise intersections for all pairs of sets? Note that this is different from the intersection of all n sets.

例如,如果我有以下几组:

For example, if I have the following sets:

A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}

我希望能够找到:

intersect_AB={"c"}
intersect_BC={"c", "e"}
intersect_AC={"a", "c"}

另一种可接受的格式(如果使事情变得更容易)将是给定集合中的项目到包含相同项目的集合的映射.例如:

Another acceptable format (if it makes things easier) would be a map of items in a given set to the sets that contain that same item. For example:

intersections_C={"a": {"A", "C"},
                 "c": {"A", "B", "C"}
                 "e": {"B", "C"}}

我知道这样做的一种方法是创建一个字典,将所有 n 个集合的并集中的每个值映射到出现它的集合的列表,然后遍历所有这些值创建上面的intersections_C之类的列表,但是我不确定随着 n 的增加和集合的大小变得太大而如何缩放.

I know that one way to do so would be to create a dictionary mapping each value in the union of all n sets to a list of the sets in which it occurs and then iterate through all of those values to create lists such as intersections_C above, but I'm not sure how that scales as n increases and the sizes of the set become too large.

一些其他背景信息:

  1. 每个集合的长度大致相同,但也非常大(足够大以至于将它们全部存储在内存中是现实的考虑,尽管没有必要使用避免这种情况的算法)
  2. 与集合本身的大小相比,任何两个集合之间的交点的大小都非常小
  3. 如果有帮助,我们可以假设需要进行有关输入集排序的任何事情.

推荐答案

这应该做您想要的

import random as RND
import string
import itertools as IT

模拟一些数据

fnx = lambda: set(RND.sample(string.ascii_uppercase, 7))
S = [fnx() for c in range(5)]

生成S中集合的索引列表,以便可以在下面更简洁地引用这些集合

idx = range(len(S))

获取S中所有可能的项目对.但是,由于集合交集是可交换的,我们希望使用组合而不是置换

get all possible unique pairs of the items in S; however, since set intersection is commutative, we want the combinations rather than permutations

pairs = IT.combinations(idx, 2)

编写函数以执行设置交点

nt = lambda a, b: S[a].intersection(S[b])

将此功能折叠到对上&将每个函数调用的结果键入其参数

res = dict([ (t, nt(*t)) for t in pairs ])

下面的结果是按照OP中引用的第一个选项格式化的,是一个字典,其中 values 是两个序列的设定交集;每个值到一个由这些序列的两个索引组成的元组

the result below, formatted per the first option recited in the OP, is a dictionary in which the values are the set intersections of two sequences; each values keyed to a tuple comprised of the two indices of those sequences

这个解决方案实际上只是两行代码:(i)计算排列; (ii)然后对每个排列应用一些函数,将返回值存储在结构化容器(键值)容器中

this solution, is really just two lines of code: (i) calculate the permutations; (ii) then apply some function over each permutation, storing the returned value in a structured container (key-value) container

此解决方案的内存占用量很小,但是您可以通过在最后一步返回生成器表达式来实现更好的效果,即

the memory footprint of this solution is minimal, but you can do even better by returning a generator expression in the last step, ie

res = ( (t, nt(*t)) for t in pairs )

请注意,使用这种方法,对的序列和相应的交集都没有写到内存中,即, res 都是迭代器.

notice that with this approach, neither the sequence of pairs nor the corresponding intersections have been written out in memory--ie, both pairs and res are iterators.

这篇关于Python中的成对设置相交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆