Python中的成对设置相交 [英] Pairwise Set Intersection in Python
问题描述
如果我有可变数量的集合(我们叫数字 n ),每个集合最多具有 m 个元素,那么计算成对的最有效方法是什么对所有集合的交集?请注意,这不同于所有 n 集的交集.
If I have a variable number of sets (let's call the number n), which have at most m elements each, what's the most efficient way to calculate the pairwise intersections for all pairs of sets? Note that this is different from the intersection of all n sets.
例如,如果我有以下几组:
For example, if I have the following sets:
A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}
我希望能够找到:
intersect_AB={"c"}
intersect_BC={"c", "e"}
intersect_AC={"a", "c"}
另一种可接受的格式(如果使事情变得更容易)将是给定集合中的项目到包含相同项目的集合的映射.例如:
Another acceptable format (if it makes things easier) would be a map of items in a given set to the sets that contain that same item. For example:
intersections_C={"a": {"A", "C"},
"c": {"A", "B", "C"}
"e": {"B", "C"}}
我知道这样做的一种方法是创建一个字典,将所有 n 个集合的并集中的每个值映射到出现它的集合的列表,然后遍历所有这些值创建上面的intersections_C
之类的列表,但是我不确定随着 n 的增加和集合的大小变得太大而如何缩放.
I know that one way to do so would be to create a dictionary mapping each value in the union of all n sets to a list of the sets in which it occurs and then iterate through all of those values to create lists such as intersections_C
above, but I'm not sure how that scales as n increases and the sizes of the set become too large.
一些其他背景信息:
- 每个集合的长度大致相同,但也非常大(足够大以至于将它们全部存储在内存中是现实的考虑,尽管没有必要使用避免这种情况的算法)
- 与集合本身的大小相比,任何两个集合之间的交点的大小都非常小
- 如果有帮助,我们可以假设需要进行有关输入集排序的任何事情.
推荐答案
这应该做您想要的
import random as RND
import string
import itertools as IT
模拟一些数据
fnx = lambda: set(RND.sample(string.ascii_uppercase, 7))
S = [fnx() for c in range(5)]
生成S中集合的索引列表,以便可以在下面更简洁地引用这些集合
idx = range(len(S))
获取S中所有可能的项目对.但是,由于集合交集是可交换的,我们希望使用组合而不是置换
get all possible unique pairs of the items in S; however, since set intersection is commutative, we want the combinations rather than permutations
pairs = IT.combinations(idx, 2)
编写函数以执行设置交点
nt = lambda a, b: S[a].intersection(S[b])
将此功能折叠到对上&将每个函数调用的结果键入其参数
res = dict([ (t, nt(*t)) for t in pairs ])
下面的结果是按照OP中引用的第一个选项格式化的,是一个字典,其中 values 是两个序列的设定交集;每个值键到一个由这些序列的两个索引组成的元组
the result below, formatted per the first option recited in the OP, is a dictionary in which the values are the set intersections of two sequences; each values keyed to a tuple comprised of the two indices of those sequences
这个解决方案实际上只是两行代码:(i)计算排列; (ii)然后对每个排列应用一些函数,将返回值存储在结构化容器(键值)容器中
this solution, is really just two lines of code: (i) calculate the permutations; (ii) then apply some function over each permutation, storing the returned value in a structured container (key-value) container
此解决方案的内存占用量很小,但是您可以通过在最后一步返回生成器表达式来实现更好的效果,即
the memory footprint of this solution is minimal, but you can do even better by returning a generator expression in the last step, ie
res = ( (t, nt(*t)) for t in pairs )
请注意,使用这种方法,对的序列和相应的交集都没有写到内存中,即,对和 res 都是迭代器.
notice that with this approach, neither the sequence of pairs nor the corresponding intersections have been written out in memory--ie, both pairs and res are iterators.
这篇关于Python中的成对设置相交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!