如何测试列表中多个值的成员资格? [英] How to test the membership of multiple values in a list?

查看:45
本文介绍了如何测试列表中多个值的成员资格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想测试列表中是否有两个或多个值具有成员资格,但得到了意外结果:

<预><代码>>>>'a','b' in ['b', 'a', 'foo', 'bar']('a', 真)

那么,Python 可以一次测试列表中多个值的成员资格吗?这个结果意味着什么?

解决方案

这可以满足您的需求,并且几乎适用于所有情况:

<预><代码>>>>all(x in ['b', 'a', 'foo', 'bar'] for x in ['a', 'b'])真的

['b', 'a', 'foo', 'bar'] 中的表达式 'a','b' 不能按预期工作,因为 Python 将其解释为元组:

<预><代码>>>>'a', 'b'('a', 'b')>>>'a', 5 + 2('a', 7)>>>'xerxes' 中的 'a'、'x'('a', 真)

其他选项

还有其他方法可以执行此测试,但它们不适用于多种不同类型的输入.正如 Kabie 指出的那样,您可以使用集合来解决这个问题...

<预><代码>>>>set(['a', 'b']).issubset(set(['a', 'b', 'foo', 'bar']))真的>>>{'a', 'b'} <= {'a', 'b', 'foo', 'bar'}真的

...有时:

<预><代码>>>>{'a', ['b']} <= {'a', ['b'], 'foo', 'bar'}回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中类型错误:不可散列的类型:列表"

只能使用可散列元素创建集合.但是生成器表达式 all(x in container for x in items) 几乎可以处理任何容器类型.唯一的要求是 container 是可重复迭代的(即不是生成器).items 可以是任何可迭代的.

<预><代码>>>>容器 = [['b'], 'a', 'foo', 'bar']>>>items = (i for i in ('a', ['b']))>>>all(x in [['b'], 'a', 'foo', 'bar'] for x in items)真的

速度测试

在许多情况下,子集测试将比 all 更快,但差异并不令人震惊——除非问题不相关,因为集合不是一个选项.仅仅为了这样的测试而将列表转换为集合并不总是值得的.将生成器转换为集合有时会非常浪费,使程序速度降低许多数量级.

以下是一些用于说明的基准.当 containeritems 都相对较小时,最大的区别就出现了.在这种情况下,子集方法大约快一个数量级:

<预><代码>>>>小集 = 集(范围(10))>>>smallsubset = set(range(5))>>>%timeit smallset >= smallsubset每个循环 110 ns ± 0.702 ns(7 次运行的平均值 ± 标准偏差,每次 10000000 次循环)>>>%timeit all(x in smallset for x in smallsubset)每个循环 951 ns ± 11.5 ns(7 次运行的平均值 ± 标准偏差,每次 1000000 次循环)

这看起来差别很大.但是只要 container 是一个集合,all 仍然可以在更大的范围内完美使用:

<预><代码>>>>大集 = 集(范围(100000))>>>bigsubset = 集(范围(50000))>>>%timeit bigset >= bigsubset每个循环 1.14 ms ± 13.9 µs(7 次运行的平均值 ± 标准偏差,每次 1000 次循环)>>>%timeit all(x in bigset for x in bigsubset)每个循环 5.96 ms ± 37 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)

使用子集测试仍然更快,但在这种规模下只能提高大约 5 倍.速度提升是由于 Python 的 c 支持的 set 快速实现,但两种情况下的基本算法是相同的.

如果您的 items 由于其他原因已经存储在列表中,那么您必须在使用子集测试方法之前将它们转换为集合.然后加速下降到大约 2.5 倍:

<预><代码>>>>%timeit bigset >= set(bigsubseq)每个循环 2.1 ms ± 49.2 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)

如果你的container是一个序列,需要先转换,那么加速就更小了:

<预><代码>>>>%timeit set(bigseq) >= set(bigsubseq)每个循环 4.36 ms ± 31.4 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)

我们唯一一次获得灾难性的缓慢结果是当我们将 container 作为一个序列时:

<预><代码>>>>%timeit all(x in bigseq for x in bigsubseq)每个循环 184 ms ± 994 µs(7 次运行的平均值 ± 标准偏差,每次 10 次循环)

当然,我们只会在必要时这样做.如果 bigseq 中的所有项目都是可散列的,那么我们将改为这样做:

<预><代码>>>>%timeit bigset = set(bigseq);all(x in bigset for x in bigsubseq)每个循环 7.24 ms ± 78 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)

这仅比替代方案快 1.66 倍(set(bigseq) >= set(bigsubseq),时间为 4.36 以上).

因此子集测试通常会更快,但幅度并不大.另一方面,让我们看看什么时候 all 更快.如果 items 是一千万个值,并且可能包含不在 container 中的值怎么办?

<预><代码>>>>%timeit hugeiter = (x * 10 for bss in [bigsubseq] * 2000 for x in bss);集(大集)> = 集(大集)每个循环 13.1 s ± 167 ms(7 次运行的平均值 ± 标准偏差,每个循环 1>>>%timeit hugeiter = (x * 10 for bss in [bigsubseq] * 2000 for x in bss);全部(x 在 bigset 为 x 在巨大的)每个循环 2.33 ms ± 65.2 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)

在这种情况下,将生成器转换为集合是非常浪费的.set 构造函数必须消耗整个生成器.但是all 的短路行为确保只需要消耗生成器的一小部分,因此它比子集测试快了四个数量级.

诚然,这是一个极端的例子.但正如它显示的那样,您不能假设一种方法或另一种方法在所有情况下都会更快.

结果

大多数情况下,将 container 转换为集合是值得的,至少如果它的所有元素都是可散列的.这是因为集合的in是O(1),而序列的in是O(n).

另一方面,使用子集测试可能只是有时值得.如果您的测试项目已经存储在一个集合中,那么一定要这样做.否则,all 只会慢一点,并且不需要任何额外的存储空间.它还可以与大型项目生成器一起使用,有时在这种情况下会提供巨大的加速.

I want to test if two or more values have membership on a list, but I'm getting an unexpected result:

>>> 'a','b' in ['b', 'a', 'foo', 'bar']
('a', True)

So, Can Python test the membership of multiple values at once in a list? What does that result mean?

解决方案

This does what you want, and will work in nearly all cases:

>>> all(x in ['b', 'a', 'foo', 'bar'] for x in ['a', 'b'])
True

The expression 'a','b' in ['b', 'a', 'foo', 'bar'] doesn't work as expected because Python interprets it as a tuple:

>>> 'a', 'b'
('a', 'b')
>>> 'a', 5 + 2
('a', 7)
>>> 'a', 'x' in 'xerxes'
('a', True)

Other Options

There are other ways to execute this test, but they won't work for as many different kinds of inputs. As Kabie points out, you can solve this problem using sets...

>>> set(['a', 'b']).issubset(set(['a', 'b', 'foo', 'bar']))
True
>>> {'a', 'b'} <= {'a', 'b', 'foo', 'bar'}
True

...sometimes:

>>> {'a', ['b']} <= {'a', ['b'], 'foo', 'bar'}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

Sets can only be created with hashable elements. But the generator expression all(x in container for x in items) can handle almost any container type. The only requirement is that container be re-iterable (i.e. not a generator). items can be any iterable at all.

>>> container = [['b'], 'a', 'foo', 'bar']
>>> items = (i for i in ('a', ['b']))
>>> all(x in [['b'], 'a', 'foo', 'bar'] for x in items)
True

Speed Tests

In many cases, the subset test will be faster than all, but the difference isn't shocking -- except when the question is irrelevant because sets aren't an option. Converting lists to sets just for the purpose of a test like this won't always be worth the trouble. And converting generators to sets can sometimes be incredibly wasteful, slowing programs down by many orders of magnitude.

Here are a few benchmarks for illustration. The biggest difference comes when both container and items are relatively small. In that case, the subset approach is about an order of magnitude faster:

>>> smallset = set(range(10))
>>> smallsubset = set(range(5))
>>> %timeit smallset >= smallsubset
110 ns ± 0.702 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit all(x in smallset for x in smallsubset)
951 ns ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This looks like a big difference. But as long as container is a set, all is still perfectly usable at vastly larger scales:

>>> bigset = set(range(100000))
>>> bigsubset = set(range(50000))
>>> %timeit bigset >= bigsubset
1.14 ms ± 13.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit all(x in bigset for x in bigsubset)
5.96 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using subset testing is still faster, but only by about 5x at this scale. The speed boost is due to Python's fast c-backed implementation of set, but the fundamental algorithm is the same in both cases.

If your items are already stored in a list for other reasons, then you'll have to convert them to a set before using the subset test approach. Then the speedup drops to about 2.5x:

>>> %timeit bigset >= set(bigsubseq)
2.1 ms ± 49.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And if your container is a sequence, and needs to be converted first, then the speedup is even smaller:

>>> %timeit set(bigseq) >= set(bigsubseq)
4.36 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The only time we get disastrously slow results is when we leave container as a sequence:

>>> %timeit all(x in bigseq for x in bigsubseq)
184 ms ± 994 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

And of course, we'll only do that if we must. If all the items in bigseq are hashable, then we'll do this instead:

>>> %timeit bigset = set(bigseq); all(x in bigset for x in bigsubseq)
7.24 ms ± 78 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's just 1.66x faster than the alternative (set(bigseq) >= set(bigsubseq), timed above at 4.36).

So subset testing is generally faster, but not by an incredible margin. On the other hand, let's look at when all is faster. What if items is ten-million values long, and is likely to have values that aren't in container?

>>> %timeit hugeiter = (x * 10 for bss in [bigsubseq] * 2000 for x in bss); set(bigset) >= set(hugeiter)
13.1 s ± 167 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit hugeiter = (x * 10 for bss in [bigsubseq] * 2000 for x in bss); all(x in bigset for x in hugeiter)
2.33 ms ± 65.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Converting the generator into a set turns out to be incredibly wasteful in this case. The set constructor has to consume the entire generator. But the short-circuiting behavior of all ensures that only a small portion of the generator needs to be consumed, so it's faster than a subset test by four orders of magnitude.

This is an extreme example, admittedly. But as it shows, you can't assume that one approach or the other will be faster in all cases.

The Upshot

Most of the time, converting container to a set is worth it, at least if all its elements are hashable. That's because in for sets is O(1), while in for sequences is O(n).

On the other hand, using subset testing is probably only worth it sometimes. Definitely do it if your test items are already stored in a set. Otherwise, all is only a little slower, and doesn't require any additional storage. It can also be used with large generators of items, and sometimes provides a massive speedup in that case.

这篇关于如何测试列表中多个值的成员资格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆