测试列表是否共享 python 中的任何项目 [英] Test if lists share any items in python

查看:16
本文介绍了测试列表是否共享 python 中的任何项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想检查一个列表中的任何项是否存在于另一个列表中.我可以用下面的代码简单地做到这一点,但我怀疑可能有一个库函数来做到这一点.如果没有,是否有更pythonic的方法来实现相同的结果.

在[78]中:a = [1, 2, 3, 4, 5]在 [79] 中:b = [8, 7, 6]在 [80] 中:c = [8, 7, 6, 5]在 [81]: def lists_overlap(a, b):....:对于我在一个:....: 如果我在 b:....:返回真....:返回假....:在 [82] 中:lists_overlap(a, b)出[82]:假在 [83] 中:lists_overlap(a, c)输出[83]:真在 [84] 中:deflists_overlap2(a, b):....: 返回 len(set(a).intersection(set(b))) >0....:

解决方案

Short answer:使用not set(a).isdisjoint(b),一般是最快的.

有四种常用方法可以测试两个列表 ab 是否共享任何项目.第一个选项是将两者都转换为集合并检查它们的交集,如下所示:

bool(set(a) & set(b))

因为集合在 Python 中使用哈希表存储,搜索它们是 O(1)(请参阅

请注意,两个轴都是对数轴.这代表了生成器表达式的最佳情况.可以看出,isdisjoint() 方法更适合非常小的列表大小,而生成器表达式更适合较大的列表大小.

另一方面,由于搜索从混合和生成器表达式的开头开始,如果共享元素系统地位于数组的末尾(或两个列表不共享任何值),则不相交和集合交集然后方法比生成器表达式和混合方法快得多.

<预><代码>>>>timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))13.739536046981812>>>timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", 数量=1000))0.08102107048034668

有趣的是,对于较大的列表大小,生成器表达式的速度要慢得多.这仅适用于 1000 次重复,而不是前一个数字的 100000.当没有元素共享时,这种设置也很近似,并且是不相交和集合相交方法的最佳情况.

这里有两个使用随机数的分析(而不是操纵设置以支持一种或另一种技术):

分享几率高:元素从[1, 2*len(a)]中随机抽取.分享几率低:元素从[1, 1000*len(a)]中随机抽取.

到目前为止,该分析假设两个列表的大小相同.在两个不同大小的列表的情况下,例如 a 小得多,isdisjoint() 总是更快:

确保a列表越小,否则性能下降.在本实验中,a 列表大小设置为常量5.

总结:

  • 如果列表非常小(<10 个元素),not set(a).isdisjoint(b) 总是最快的.
  • 如果列表中的元素已排序或具有您可以利用的规则结构,则生成器表达式 any(i in a for i in b) 在大列表中是最快的;
  • not set(a).isdisjoint(b)测试集合交集,它总是比bool(set(a) & set(b))>.
  • 混合遍历列表,测试集合"a = set(a);any(i in a for i in b) 通常比其他方法慢.
  • 在不共享元素的列表中,生成器表达式和混合比其他两种方法慢得多.

在大多数情况下,使用 isdisjoint() 方法是最好的方法,因为生成器表达式将花费更长的时间来执行,因为在没有共享元素时效率非常低.

I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.

In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
   ....:     for i in a:
   ....:         if i in b:
   ....:             return True
   ....:     return False
   ....: 

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
   ....:     return len(set(a).intersection(set(b))) > 0
   ....: 

解决方案

Short answer: use not set(a).isdisjoint(b), it's generally the fastest.

There are four common ways to test if two lists a and b share any items. The first option is to convert both to sets and check their intersection, as such:

bool(set(a) & set(b))

Because sets are stored using a hash table in Python, searching them is O(1) (see here for more information about complexity of operators in Python). Theoretically, this is O(n+m) on average for n and m objects in lists a and b. But 1) it must first create sets out of the lists, which can take a non-negligible amount of time, and 2) it supposes that hashing collisions are sparse among your data.

The second way to do it is using a generator expression performing iteration on the lists, such as:

any(i in a for i in b)

This allows to search in-place, so no new memory is allocated for intermediary variables. It also bails out on the first find. But the in operator is always O(n) on lists (see here).

Another proposed option is an hybridto iterate through one of the list, convert the other one in a set and test for membership on this set, like so:

a = set(a); any(i in a for i in b)

A fourth approach is to take advantage of the isdisjoint() method of the (frozen)sets (see here), for example:

not set(a).isdisjoint(b)

If the elements you search are near the beginning of an array (e.g. it is sorted), the generator expression is favored, as the sets intersection method have to allocate new memory for the intermediary variables:

from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974

Here's a graph of the execution time for this example in function of list size:

Note that both axes are logarithmic. This represents the best case for the generator expression. As can be seen, the isdisjoint() method is better for very small list sizes, whereas the generator expression is better for larger list sizes.

On the other hand, as the search begins with the beginning for the hybrid and generator expression, if the shared element are systematically at the end of the array (or both lists does not share any values), the disjoint and set intersection approaches are then way faster than the generator expression and the hybrid approach.

>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668

It is interesting to note that the generator expression is way slower for bigger list sizes. This is only for 1000 repetitions, instead of the 100000 for the previous figure. This setup also approximates well when when no elements are shared, and is the best case for the disjoint and set intersection approaches.

Here are two analysis using random numbers (instead of rigging the setup to favor one technique or another):

High chance of sharing: elements are randomly taken from [1, 2*len(a)]. Low chance of sharing: elements are randomly taken from [1, 1000*len(a)].

Up to now, this analysis supposed both lists are of the same size. In case of two lists of different sizes, for example a is much smaller, isdisjoint() is always faster:

Make sure that the a list is the smaller, otherwise the performance decreases. In this experiment, the a list size was set constant to 5.

In summary:

  • If the lists are very small (< 10 elements), not set(a).isdisjoint(b) is always the fastest.
  • If the elements in the lists are sorted or have a regular structure that you can take advantage of, the generator expression any(i in a for i in b) is the fastest on large list sizes;
  • Test the set intersection with not set(a).isdisjoint(b), which is always faster than bool(set(a) & set(b)).
  • The hybrid "iterate through list, test on set" a = set(a); any(i in a for i in b) is generally slower than other methods.
  • The generator expression and the hybrid are much slower than the two other approaches when it comes to lists without sharing elements.

In most cases, using the isdisjoint() method is the best approach as the generator expression will take much longer to execute, as it is very inefficient when no elements are shared.

这篇关于测试列表是否共享 python 中的任何项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆