消耗迭代器的最快(最Pythonic)方式 [英] Fastest (most Pythonic) way to consume an iterator

查看:129
本文介绍了消耗迭代器的最快(最Pythonic)方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇消费迭代器最快的方式是什么,也是最Python化的方式.

I am curious what the fastest way to consume an iterator would be, and the most Pythonic way.

例如,假设我要使用内置的map创建一个迭代器,该迭代器会累积一些副作用.我实际上并不在乎map的结果,只是在乎副作用,因此我想以尽可能少的开销或样板完成迭代.像这样:

For example, say that I want to create an iterator with the map builtin that accumulates something as a side-effect. I don't actually care about the result of the map, just the side effect, so I want to blow through the iteration with as little overhead or boilerplate as possible. Something like:

my_set = set()
my_map = map(lambda x, y: my_set.add((x, y)), my_x, my_y)

在此示例中,我只想遍历迭代器以积累my_set中的内容,而my_set只是一个空集,直到我实际运行my_map为止.像这样:

In this example, I just want to blow through the iterator to accumulate things in my_set, and my_set is just an empty set until I actually run through my_map. Something like:

for _ in my_map:
    pass

或裸露

[_ for _ in my_map]

有效,但是他们俩都觉得笨拙.有没有一种更Python化的方法来确保迭代器快速迭代,以便您可以从某些副作用中受益?

works, but they both feel clunky. Is there a more Pythonic way to make sure an iterator iterates quickly so that you can benefit from some side-effect?

我在以下方面测试了上述两种方法:

I tested the two methods above on the following:

my_x = np.random.randint(100, size=int(1e6))
my_y = np.random.randint(100, size=int(1e6))

如上定义的

my_setmy_map.我在timeit上得到了以下结果:

with my_set and my_map as defined above. I got the following results with timeit:

for _ in my_map:
    pass
468 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[_ for _ in my_map]
476 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

两者之间没有真正的区别,而且两者都显得笨拙.

No real difference between the two, and they both feel clunky.

请注意,我在list(my_map)上获得了类似的效果,这是评论中的建议.

Note, I got similar performance with list(my_map), which was a suggestion in the comments.

推荐答案

虽然您不应该仅出于副作用而创建地图对象,但实际上

While you shouldn't be creating a map object just for side effects, there is in fact a standard recipe for consuming iterators in the itertools docs:

def consume(iterator, n=None):
    "Advance the iterator n-steps ahead. If n is None, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

对于完全消费"的情况,可以简化为

For just the "consume entirely" case, this can be simplified to

def consume(iterator):
    collections.deque(iterator, maxlen=0)

以这种方式使用collections.deque避免了存储所有元素(因为maxlen=0)并以C速度进行迭代,而没有字节码解释开销.在双端队列中甚至还有一个专用快速路径使用maxlen=0双端队列消耗迭代器的实现.

Using collections.deque this way avoids storing all the elements (because maxlen=0) and iterates at C speed, without bytecode interpretation overhead. There's even a dedicated fast path in the deque implementation for using a maxlen=0 deque to consume an iterator.

时间:

In [1]: import collections

In [2]: x = range(1000)

In [3]: %%timeit
   ...: i = iter(x)
   ...: for _ in i:
   ...:     pass
   ...: 
16.5 µs ± 829 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]: %%timeit
   ...: i = iter(x)
   ...: collections.deque(i, maxlen=0)
   ...: 
12 µs ± 566 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

当然,这都是基于CPython的.在其他Python实现中,解释器开销的整体性质非常不同,并且maxlen=0快速路径特定于CPython.有关其他Python实现,请参见 abarnert的答案.

Of course, this is all based on CPython. The entire nature of interpreter overhead is very different on other Python implementations, and the maxlen=0 fast path is specific to CPython. See abarnert's answer for other Python implementations.

这篇关于消耗迭代器的最快(最Pythonic)方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆