Python:快速提取大量列表中所有可能的2个组合之间的交点 [英] Python: Fast extraction of intersections among all possible 2-combinations in a large number of lists

查看:733
本文介绍了Python:快速提取大量列表中所有可能的2个组合之间的交点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个约ca的数据集. 9K可变长度的列表(1到100K个元素).我需要计算此数据集中所有可能的2列表组合的交集长度.请注意,每个列表中的元素都是唯一的,因此可以将它们存储为python中的集合.

I have a dataset of ca. 9K lists of variable length (1 to 100K elements). I need to calculate the length of the intersection of all possible 2-list combinations in this dataset. Note that elements in each list are unique so they can be stored as sets in python.

在python中执行此操作的最有效方法是什么?

What is the most efficient way to perform this in python?

编辑我忘记指定需要将相交值匹配到相应列表对的功能.谢谢大家的及时响应,对于造成的困惑,我们深表歉意!

Edit I forgot to specify that I need to have the ability to match the intersection values to the corresponding pair of lists. Thanks everybody for the prompt response and apologies for the confusion!

推荐答案

如果您的集合存储在s中,例如:

If your sets are stored in s, for example:

s = [set([1, 2]), set([1, 3]), set([1, 2, 3]), set([2, 4])]

然后,您可以使用 itertools.combinations 将它们乘以2乘以2并计算交点(请注意,正如Alex所指出的,combinations仅在2.6版以后可用).这里带有列表理解(仅出于示例目的):

Then you can use itertools.combinations to take them two by two, and calculate the intersection (note that, as Alex pointed out, combinations is only available since version 2.6). Here with a list comrehension (just for the sake of the example):

from itertools import combinations
[ i[0] & i[1] for i in combinations(s,2) ]

或者,可能是您可能需要一个循环:

Or, in a loop, which is probably what you need:

for i in combinations(s, 2):
    inter = i[0] & i[1]
    # processes the intersection set result "inter"

因此,要确定每个元素的长度,处理"将是:

So, to have the length of each one of them, that "processing" would be:

    l = len(inter)

这将非常有效,因为它使用迭代器来计算每个组合,并且不会事先准备所有组合.

This would be quite efficient, since it's using iterators to compute every combinations, and does not prepare all of them in advance.

编辑:请注意,使用此方法,列表"s"中的每个集合实际上可以是返回集合的其他东西,例如生成器.如果您的内存不足,列表本身可能只是生成器.不过,这可能会慢得多,具体取决于您生成这些元素的方式,但是您不必同时在内存中存储整个集的列表(并不是您的情况会造成问题).

Edit: Note that with this method, each set in the list "s" can actually be something else that returns a set, like a generator. The list itself could simply be a generator if you are short on memory. It could be much slower though, depending on how you generate these elements, but you wouldn't need to have the whole list of sets in memory at the same time (not that it should be a problem in your case).

例如,如果每个集合都是由函数gen组成的:

For example, if each set is made from a function gen:

def gen(parameter):
    while more_sets():
        # ... some code to generate the next set 'x'
        yield x

with open("results", "wt") as f_results:
    for i in combinations(gen("data"), 2):
        inter = i[0] & i[1]
        f_results.write("%d\n" % len(inter))


编辑2 :如何收集索引(在redrat的评论之后).


Edit 2: How to collect indices (following redrat's comment).

除了我在评论中回答的快速解决方案之外,收集集合索引的更有效方法是使用(index, set)列表,而不是set列表.

Besides the quick solution I answered in comment, a more efficient way to collect the set indices would be to have a list of (index, set) instead of a list of set.

具有新格式的示例:

s = [(0, set([1, 2])), (1, set([1, 3])), (2, set([1, 2, 3]))]

如果您正在建立此列表以计算组合,则应很容易适应您的新要求.主循环变为:

If you are building this list to calculate the combinations anyway, it should be simple to adapt to your new requirements. The main loop becomes:

with open("results", "wt") as f_results:
    for i in combinations(s, 2):
        inter = i[0][1] & i[1][1]
        f_results.write("length of %d & %d: %d\n" % (i[0][0],i[1][0],len(inter))

在循环中,i[0]i[1]将是一个元组(index, set),因此i[0][1]是第一个集合,i[0][0]是其索引.

In the loop, i[0] and i[1] would be a tuple (index, set), so i[0][1] is the first set, i[0][0] its index.

这篇关于Python:快速提取大量列表中所有可能的2个组合之间的交点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆