测试列表是否共享python中的任何项目 [英] Test if lists share any items in python

查看:77
本文介绍了测试列表是否共享python中的任何项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想检查一个列表中的任何个项目是否存在于另一列表中.我可以使用下面的代码简单地做到这一点,但是我怀疑可能有一个库函数可以做到这一点.如果没有,是否有更多的pythonic方法可以达到相同的结果.

I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.

In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
   ....:     for i in a:
   ....:         if i in b:
   ....:             return True
   ....:     return False
   ....: 

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
   ....:     return len(set(a).intersection(set(b))) > 0
   ....: 

推荐答案

简短答案:使用not set(a).isdisjoint(b),通常最快.

有四种常见方法可以测试两个列表ab是否共享任何项目.第一种选择是将两个都转换为集合并检查它们的交集,如下所示:

There are four common ways to test if two lists a and b share any items. The first option is to convert both to sets and check their intersection, as such:

bool(set(a) & set(b))

因为集合是使用Python中的哈希表存储的,所以搜索它们是O(1) (请参见

Because sets are stored using a hash table in Python, searching them is O(1) (see here for more information about complexity of operators in Python). Theoretically, this is O(n+m) on average for n and m objects in lists a and b. But 1) it must first create sets out of the lists, which can take a non-negligible amount of time, and 2) it supposes that hashing collisions are sparse among your data.

第二种方法是使用生成器表达式对列表执行迭代,例如:

The second way to do it is using a generator expression performing iteration on the lists, such as:

any(i in a for i in b)

这允许就地搜索,因此不会为中间变量分配新的内存.它也可以在第一个发现上解决. in运算符始终在列表中O(n) (请参见此处).

This allows to search in-place, so no new memory is allocated for intermediary variables. It also bails out on the first find. But the in operator is always O(n) on lists (see here).

另一个建议的选项是通过迭代来遍历列表中的一个,将另一个转换成一个集合,并测试该集合的成员资格,就像这样:

Another proposed option is an hybridto iterate through one of the list, convert the other one in a set and test for membership on this set, like so:

a = set(a); any(i in a for i in b)

第四种方法是利用(冻结)集合的isdisjoint()方法(请参阅此处),例如:

A fourth approach is to take advantage of the isdisjoint() method of the (frozen)sets (see here), for example:

not set(a).isdisjoint(b)

如果搜索的元素靠近数组的开头(例如已排序),则倾向于使用生成器表达式,因为集合交集方法必须为中间变量分配新的内存:

If the elements you search are near the beginning of an array (e.g. it is sorted), the generator expression is favored, as the sets intersection method have to allocate new memory for the intermediary variables:

from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974

以下是此示例的执行时间图,该列表以列表大小为函数:

Here's a graph of the execution time for this example in function of list size:

请注意,两个轴都是对数的.这代表了生成器表达式的最佳情况.可以看出,isdisjoint()方法对于较小的列表大小更好,而生成器表达式对于较大的列表大小更好.

Note that both axes are logarithmic. This represents the best case for the generator expression. As can be seen, the isdisjoint() method is better for very small list sizes, whereas the generator expression is better for larger list sizes.

另一方面,由于搜索是从混合表达式和生成器表达式的开头开始的,因此如果共享元素系统地位于数组的末尾(或者两个列表都不共享任何值),则不相交和集合相交然后,这些方法比生成器表达式和混合方法要快得多.

On the other hand, as the search begins with the beginning for the hybrid and generator expression, if the shared element are systematically at the end of the array (or both lists does not share any values), the disjoint and set intersection approaches are then way faster than the generator expression and the hybrid approach.

>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668

有趣的是,对于较大的列表大小,生成器表达式的速度要慢得多.这仅适用于1000次重复,而不是前一个数字的100000次.当没有共享任何元素时,这种设置也很合适,这是不相交和设置相交方法的最佳情况.

It is interesting to note that the generator expression is way slower for bigger list sizes. This is only for 1000 repetitions, instead of the 100000 for the previous figure. This setup also approximates well when when no elements are shared, and is the best case for the disjoint and set intersection approaches.

这里有两个使用随机数的分析(而不是操纵设置以支持一种或另一种技术):

Here are two analysis using random numbers (instead of rigging the setup to favor one technique or another):

分享的可能性很高:元素是从[1, 2*len(a)]中随机抽取的.分享机会低:元素是从[1, 1000*len(a)]中随机抽取的.

High chance of sharing: elements are randomly taken from [1, 2*len(a)]. Low chance of sharing: elements are randomly taken from [1, 1000*len(a)].

到目前为止,该分析假设两个列表的大小相同.如果有两个大小不同的列表,例如a小得多,isdisjoint()总是更快:

Up to now, this analysis supposed both lists are of the same size. In case of two lists of different sizes, for example a is much smaller, isdisjoint() is always faster:

请确保a列表较小,否则性能会降低.在此实验中,将a列表大小设置为常量5.

Make sure that the a list is the smaller, otherwise the performance decreases. In this experiment, the a list size was set constant to 5.

总结:

  • 如果列表很小(<10个元素),则not set(a).isdisjoint(b)始终是最快的.
  • 如果列表中的元素已排序或具有可利用的常规结构,则生成器表达式any(i in a for i in b)在大列表大小上最快;
  • 使用not set(a).isdisjoint(b)测试设置的交集,该交集总是比bool(set(a) & set(b))快.
  • 混合遍历列表,在设置中进行测试" a = set(a); any(i in a for i in b)通常比其他方法慢.
  • 在不共享元素的列表中,生成器表达式和混合函数比其他两种方法要慢得多.
  • If the lists are very small (< 10 elements), not set(a).isdisjoint(b) is always the fastest.
  • If the elements in the lists are sorted or have a regular structure that you can take advantage of, the generator expression any(i in a for i in b) is the fastest on large list sizes;
  • Test the set intersection with not set(a).isdisjoint(b), which is always faster than bool(set(a) & set(b)).
  • The hybrid "iterate through list, test on set" a = set(a); any(i in a for i in b) is generally slower than other methods.
  • The generator expression and the hybrid are much slower than the two other approaches when it comes to lists without sharing elements.

在大多数情况下,使用isdisjoint()方法是最好的方法,因为生成器表达式的执行时间会更长,因为在没有共享任何元素的情况下效率非常低.

In most cases, using the isdisjoint() method is the best approach as the generator expression will take much longer to execute, as it is very inefficient when no elements are shared.

这篇关于测试列表是否共享python中的任何项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆