缓存生成器 [英] Caching a generator

查看:116
本文介绍了缓存生成器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近类似的问题( isinstance(foo,types.GeneratorType)还是inspect.isgenerator(foo)?)让我很好奇如何通用地实现这一点。

A recent similar question (isinstance(foo, types.GeneratorType) or inspect.isgenerator(foo)?) got me curious about how to implement this generically.

实际上,拥有一个生成器类型的对象将在第一次通过时进行缓存似乎很普遍(例如 itertools.cycle ),报告StopIteration,然后再次从缓存中返回项,但是如果该对象不是生成器(即固有支持O(1)查找的列表或字典) ,则不要缓存,并且具有相同的行为,但对于原始列表。

It seems like a generally-useful thing to have, actually, to have a generator-type object that will cache the first time through (like itertools.cycle), report StopIteration, and then return items from the cache next time through, but if the object isn't a generator (i.e. a list or dict that inherently supports O(1) lookup), then don't cache, and have the same behaviour, but for the original list.

可能性:

1)修改itertools.cycle。看起来像这样:

1) Modify itertools.cycle. It looks like this:

def cycle(iterable):
    saved = []
    try: 
         saved.append(iterable.next())
         yield saved[-1]
         isiter = True
    except:
         saved = iterable
         isiter = False
    # cycle('ABCD') --> A B C D A B C D A B C D ...
    for element in iterable:
        yield element
        if isiter: 
            saved.append(element)

     # ??? What next?

如果我可以重新启动生成器,那将是完美的-我可以发回StopIteration,然后在下一个gen.next()上,返回条目0,即 ABCD StopIteration ABCD StopIteration,但实际上似乎不可行。

If I could restart the generator, that would be perfect - I could send back a StopIteration, and then on the next gen.next(), return entry 0 i.e. `A B C D StopIteration A B C D StopIteration' but it doesn't look like that's actually possible.

第二个是一旦命中StopIteration,然后保存的内容将具有缓存。但这似乎无法找到内部保存的[]字段。也许这是一个班级版本?

Second would be that once StopIteration is hit, then saved has a cache. But it doesn't look like there's any way to get to the internal saved[] field. Maybe a class version of this?

2)或者我可以直接通过列表:

2) Or I could pass in the list directly:

def cycle(iterable, saved=[]):
    saved.clear()
    try: 
         saved.append(iterable.next())
         yield saved[-1]
         isiter = True
    except:
         saved = iterable
         isiter = False
    # cycle('ABCD') --> A B C D A B C D A B C D ...
    for element in iterable:
        yield element
        if isiter: 
            saved.append(element)

mysaved = []
myiter = cycle(someiter, mysaved)

但这看起来很讨厌。在C / ++中,我可以传递一些引用,然后将实际引用更改为save指向可迭代-您实际上无法在python中执行此操作。因此,这甚至行不通。

But that just looks nasty. And in C/++ I could pass in some reference, and change the actual reference to saved to point to iterable - you can't actually do that in python. So this doesn't even work.

其他选项?

编辑:更多数据。 CachingIterable方法似乎太慢而无法生效,但确实将我推向了可能可行的方向。它比朴素的方法慢一些(转换为列出自己的名字),但是如果它已经是可迭代的,似乎就不会受到打击。

More data. The CachingIterable method appears to be too slow to be effective, but it did push me in a direction that might work. It's slightly slower than the naive method (converting to list myself), but appears not to take the hit if it's already iterable.

某些代码和数据:

def cube_generator(max=100):
    i = 0
    while i < max:
        yield i*i*i
        i += 1

# Base case: use generator each time
%%timeit
cg = cube_generator(); [x for x in cg]
cg = cube_generator(); [x for x in cg]
cg = cube_generator(); [x for x in cg]
10000 loops, best of 3: 55.4 us per loop

# Fastest case: flatten to list, then iterate
%%timeit
cg = cube_generator()
cl = list(cg)
[x for x in cl]
[x for x in cl]
[x for x in cl]
10000 loops, best of 3: 27.4 us per loop

%%timeit
cg = cube_generator()
ci2 = CachingIterable(cg)
[x for x in ci2]
[x for x in ci2]
[x for x in ci2]
1000 loops, best of 3: 239 us per loop

# Another attempt, which is closer to the above
# Not exactly the original solution using next, but close enough i guess
class CacheGen(object):
    def __init__(self, iterable):
        if isinstance(iterable, (list, tuple, dict)):
            self._myiter = iterable
        else:
            self._myiter = list(iterable)
    def __iter__(self):
        return self._myiter.__iter__()
    def __contains__(self, key):
        return self._myiter.__contains__(key)
    def __getitem__(self, key):
        return self._myiter.__getitem__(key)

%%timeit
cg = cube_generator()
ci = CacheGen(cg)
[x for x in ci]
[x for x in ci]
[x for x in ci]
10000 loops, best of 3: 30.5 us per loop

# But if you start with a list, it is faster
cg = cube_generator()
cl = list(cg)
%%timeit
[x for x in cl]
[x for x in cl]
[x for x in cl]
100000 loops, best of 3: 11.6 us per loop

%%timeit
ci = CacheGen(cl)
[x for x in ci]
[x for x in ci]
[x for x in ci]
100000 loops, best of 3: 13.5 us per loop

任何更快的配方可以接近纯循环?

Any faster recipes that can get closer to the 'pure' loop?

推荐答案

基于此评论:


<我在这里的意图是,仅当用户知道他要在可迭代上进行多次迭代,而又不知道输入是生成器还是可迭代时,才使用此方法。

my intention here is that this would only be used if the user knows he wants to iterate multiple times over the 'iterable', but doesn't know if the input is a generator or iterable. this lets you ignore that distinction, while not losing (much) performance.

这个简单的解决方案正是这样做的:

This simple solution does exactly that:

def ensure_list(it):
    if isinstance(it, (list, tuple, dict)):
        return it
    else:
        return list(it)

现在 sure_list(a_list)实际上是一个无操作-两个函数调用-而 ensure_list(a_generator)会将其转换为列表并返回,事实证明它比任何其他方法都快。

now ensure_list(a_list) is practically a no-op - two function calls - while ensure_list(a_generator) will turn it into a list and return it, which turned out to be faster than any other approach.

这篇关于缓存生成器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆