python中非常长的可迭代样本 [英] Random sample from a very long iterable, in python

查看:112
本文介绍了python中非常长的可迭代样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很长的python生成器,我想通过随机选择值的一个子集来瘦身".不幸的是,random.sample()不适用于任意可迭代.显然,它需要某种支持len()操作的东西(可能是对序列的非顺序访问,但这尚不清楚).而且我不想建立一个庞大的清单,只是为了让我精打细算.

I have a long python generator that I want to "thin out" by randomly selecting a subset of values. Unfortunately, random.sample() will not work with arbitrary iterables. Apparently, it needs something that supports the len() operation (and perhaps non-sequential access to the sequence, but that's not clear). And I don't want to build an enormous list just so I can thin it out.

事实上,可以在不知道其长度的情况下,一次均匀地从序列 中进行采样-Programming perl中有一个很好的算法可以做到这一点( 水库采样",谢谢@ user2357112!).但是有人知道提供此功能的标准python模块吗?

As a matter of fact, it is possible to sample from a sequence uniformly in one pass, without knowing its length-- there's a nice algorithm in Programming perl that does just that (edit: "reservoir sampling", thanks @user2357112!). But does anyone know of a standard python module that provides this functionality?

问题演示(Python 3)

Demo of the problem (Python 3)

>>> import itertools, random
>>> random.sample(iter("abcd"), 2)
...
TypeError: Population must be a sequence or set.  For dicts, use list(d).

在Python 2上,错误更加透明:

On Python 2, the error is more transparent:

Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    random.sample(iter("abcd"), 2)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/random.py", line 321, in sample
    n = len(population)
TypeError: object of type 'iterator' has no len()

如果没有random.sample()的替代品,我会碰运气,将生成器包装到提供__len__方法的对象中(我可以提前找到长度).因此,我将接受一个答案,该答案表明了如何干净地做到这一点.

If there's no alternative to random.sample(), I'd try my luck with wrapping the generator into an object that provides a __len__ method (I can find out the length in advance). So I'll accept an answer that shows how to do that cleanly.

推荐答案

由于您知道可迭代对象返回的数据的长度,因此可以使用xrange()快速生成可迭代对象的索引.然后,您可以运行迭代,直到获取所有数据为止:

Since you know the length the data returned by your iterable, you can use xrange() to quickly generate indices into your iterable. Then you can just run the iterable until you've grabbed all of the data:

import random

def sample(it, length, k):
    indices = random.sample(xrange(length), k)
    result = [None]*k
    for index, datum in enumerate(it):
        if index in indices:
            result[indices.index(index)] = datum
    return result

print sample(iter("abcd"), 4, 2)

或者,这是使用"Algorithm R"实现的脂蛋白采样的一种实现方式:

In the alternative, here is an implementation of resevior sampleing using "Algorithm R":

import random

def R(it, k):
    '''https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R'''
    it = iter(it)
    result = []
    for i, datum in enumerate(it):
        if i < k:
            result.append(datum)
        else:
            j = random.randint(0, i-1)
            if j < k:
                result[j] = datum
    return result

print R(iter("abcd"), 2)

请注意,算法R不会为结果提供随机顺序.在给定的示例中,结果中'b'绝不会在'a'之前.

Note that algorithm R doesn't provide a random order for the results. In the example given, 'b' will never precede 'a' in the results.

这篇关于python中非常长的可迭代样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆