找到n个集合列表的最快方法 [英] fastest way to find the intersection of n lists of sets

查看:59
本文介绍了找到n个集合列表的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有3个可变长度的集合列表。我需要在每个列表(跨集)中找到常见的

元素真的非常快。


以下是一些示例代码:


#联合集合是没有意义的 - 我们要去做

以后的交叉点

l1 = reduce(运营商。在l1中为x添加,列出(x))

l2 = reduce(operator.add,l2中x的list(x))

l3 = reduce(运算符。在l3中为x添加,列出(x)


#我应该分两步完成吗?也许首先交叉两个

最短列表?

s = frozenset(l1)& frozenset(l2)& frozenset(l3)

我假设frozensets(不知何故)比设置更快因为

他们是不可变的。


任何代码建议?也许在新的花式学习中使用一些东西

itertools模块?


谢谢,

Prateek

解决方案

Prateek写道:


我有3个可变长度的集合列表。我需要在每个列表(跨集)中找到常见的

元素真的非常快。


以下是一些示例代码:


#联合集合是没有意义的 - 我们要去做

以后的交叉点

l1 = reduce(运营商。在l1中为x添加,列出(x))

l2 = reduce(operator.add,l2中x的list(x))

l3 = reduce(运算符。在l3中为x添加,列出(x)


#我应该分两步完成吗?也许首先交叉两个

最短列表?

s = frozenset(l1)& frozenset(l2)& frozenset(l3)

我假设frozensets(不知何故)比设置更快因为

他们是不可变的。


任何代码建议?也许在新的花式学习中使用一些东西

itertools模块?


谢谢,

Prateek


我不明白你为什么要上榜。我建议:


lists_of_sets = [l1,l2,l3]

reduce(set.intersection,(reduce(set.union, x)for lists_of_sets中的x))


因为这是最简单的,我猜它应该是最快的,因为我是

也猜测那个设置impelmentation像列表一样优化 - 我认为

这对于列表之间稀疏重叠的大型集合尤其如此。所以考虑你的

集合和列表的内容并根据内容或假设

特定组成来编写代码可能是合理的。


Prateek写道:
< blockquote class =post_quotes>
>我有3个可变长度的集合列表。我需要在每个列表中找到常见的
元素(跨集)真的非常快。

以下是一些示例代码:

#Doesn't make联合集合的意义 - 我们将要做的交叉后来无论如何
l1 = reduce(operator.add,列表(x)为l1中的x)
l2 = reduce(运算符) .add,列表(x)表示l2中的x)
l3 = reduce(operator.add,列表(x)表示l3中的x)

#我应该分两步完成吗?也许首先将两个最短的列表相交?
s = frozenset(l1)& frozenset(l2)& frozenset(l3)

我假设frozensets(不知何故)比set更快,因为
他们是不可变的。

任何代码建议?也许在新的花式schmancy
itertools模块中使用了一些东西?

谢谢,
Prateek



我不喜欢不明白你为什么要加入名单。我建议:


lists_of_sets = [l1,l2,l3]

reduce(set.intersection,(reduce(set.union, x)for lists_of_sets中的x))


因为这是最简单的,我猜它应该是最快的,因为我是

也猜测那个设置impelmentation像列表一样优化 - 我认为

这对于列表之间稀疏重叠的大型集合尤其如此。因此,根据内容或假设

特定组合,考虑您的

集和列表的内容并编写代码可能是合理的。



Python集合是散列,如字典,而不是树。交叉点

是通过迭代最小的集合并在另一个集合上尝试其所有键

来实现的。 Python实现比较了相交的两个

集的长度。这没关系(它大约是O(N log N),也许更好)。


对于上面的例子,值得按
排序lists_of_sets
套装的长度,先做短套装。


套装有多大?如果它们很小,但是你有很多它们,那么你可能会更好地使用位设置表示,然后使用AND操作来交叉使用
。如果他们是巨大的(数千万美元/ b $ b条目),你可能会更好地在

集上进行排序和合并。

当你提出这样的问题时,给一些

背景是有帮助的。我们不知道这是否是一项家庭作业,或者是一些大型的应用程序,这些应用程序很慢而你需要修复它,如果它需要很重的话,甚至需要

实施工作。


John Nagle


对于上面的例子,值得按
对lists_of_sets进行排序


集合的长度,先做短暂的。



谢谢。我是这么认为的 - 我正在使用一个简单的装饰 -

排序 - 未装饰成语。


集合有多​​大?如果它们很小,但是你有很多它们,那么你可能会更好地使用位设置表示,然后使用AND操作来交叉使用
。如果他们是巨大的(数千万美元/ b $ b条目),你最好还是在

集上进行排序和合并。



我有2套或3套(从不多)可以任意大。

大多数都很小(介于0和0之间)少数元素 - 少说5个。)

一些将是巨型的(100,000件)


当你提出这样的问题时,它给一些

背景是有帮助的。我们不知道这是否是一项家庭作业,或者是一些大型的应用程序,这些应用程序很慢而你需要修复它,如果它需要很重的话,甚至需要

实施工作。



绝对不是家庭作业 - 它是商业

数据库查询引擎的一部分。繁重的实施工作没有问题。


Prateek


I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.

Here is some sample code:

# Doesn''t make sense to union the sets - we''re going to do
intersections later anyway
l1 = reduce(operator.add, list(x) for x in l1)
l2 = reduce(operator.add, list(x) for x in l2)
l3 = reduce(operator.add, list(x) for x in l3)

# Should I do this in two steps? Maybe by intersecting the two
shortest lists first?
s = frozenset(l1) & frozenset(l2) & frozenset(l3)

I''m assuming frozensets are (somehow) quicker than sets because
they''re immutable.

Any code suggestions? Maybe using something in the new fancy-schmancy
itertools module?

Thanks,
Prateek

解决方案

Prateek wrote:

I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.

Here is some sample code:

# Doesn''t make sense to union the sets - we''re going to do
intersections later anyway
l1 = reduce(operator.add, list(x) for x in l1)
l2 = reduce(operator.add, list(x) for x in l2)
l3 = reduce(operator.add, list(x) for x in l3)

# Should I do this in two steps? Maybe by intersecting the two
shortest lists first?
s = frozenset(l1) & frozenset(l2) & frozenset(l3)

I''m assuming frozensets are (somehow) quicker than sets because
they''re immutable.

Any code suggestions? Maybe using something in the new fancy-schmancy
itertools module?

Thanks,
Prateek

I don''t understand why you cast to list. I would propose:

lists_of_sets = [l1, l2, l3]

reduce(set.intersection, (reduce(set.union, x) for x in lists_of_sets))

Since this is simplest, I''m guessing it should be fastest because I''m
also guessing that set impelmentation is as optimized as list--I think
this would hold true especially for large sets with sparse overlap
between lists. So it might be reasonable consider the content of your
sets and lists and write your code based on the content or on assuming a
particular composition.

James


James Stroud wrote:

Prateek wrote:

>I have 3 variable length lists of sets. I need to find the common
elements in each list (across sets) really really quickly.

Here is some sample code:

# Doesn''t make sense to union the sets - we''re going to do
intersections later anyway
l1 = reduce(operator.add, list(x) for x in l1)
l2 = reduce(operator.add, list(x) for x in l2)
l3 = reduce(operator.add, list(x) for x in l3)

# Should I do this in two steps? Maybe by intersecting the two
shortest lists first?
s = frozenset(l1) & frozenset(l2) & frozenset(l3)

I''m assuming frozensets are (somehow) quicker than sets because
they''re immutable.

Any code suggestions? Maybe using something in the new fancy-schmancy
itertools module?

Thanks,
Prateek


I don''t understand why you cast to list. I would propose:

lists_of_sets = [l1, l2, l3]

reduce(set.intersection, (reduce(set.union, x) for x in lists_of_sets))

Since this is simplest, I''m guessing it should be fastest because I''m
also guessing that set impelmentation is as optimized as list--I think
this would hold true especially for large sets with sparse overlap
between lists. So it might be reasonable consider the content of your
sets and lists and write your code based on the content or on assuming a
particular composition.

Python sets are hashes, like dictionaries, not trees. Intersection
is implemented by iterating over the smallest set and trying all its keys
on the other set. The Python implementation compares the length of two
sets being intersected. This is OK (it''s about O(N log N), maybe better).

For the above example, it''s worth sorting lists_of_sets by the
length of the sets, and doing the short ones first.

How big are the sets? If they''re small, but you have a lot of
them, you may be better off with a bit-set representation, then
using AND operations for intersection. If they''re huge (tens of millions
of entries), you might be better off doing sorts and merges on the
sets.

When you ask questions like this, it''s helpful to give some
background. We don''t know whether this is a homework assignment, or
some massive application that''s slow and you need to fix it, even
if it requires heavy implementation effort.

John Nagle


For the above example, it''s worth sorting lists_of_sets by the

length of the sets, and doing the short ones first.

Thanks. I thought so - I''m doing just that using a simple Decorate-
Sort-Undecorate idiom.

How big are the sets? If they''re small, but you have a lot of
them, you may be better off with a bit-set representation, then
using AND operations for intersection. If they''re huge (tens of millions
of entries), you might be better off doing sorts and merges on the
sets.

I have either 2 or 3 sets (never more) which can be arbitrarily large.
Most of them are small (between 0 and few elements - say less that 5).
A few will be megamonstrous ( 100,000 items)

When you ask questions like this, it''s helpful to give some
background. We don''t know whether this is a homework assignment, or
some massive application that''s slow and you need to fix it, even
if it requires heavy implementation effort.

Its definitely not a homework assignment - its part of a commercial
database query engine. Heavy implementation effort is no problem.

Prateek


这篇关于找到n个集合列表的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆