带有生成器/可迭代/迭代器的Python随机样本 [英] Python random sample with a generator / iterable / iterator

查看:101
本文介绍了带有生成器/可迭代/迭代器的Python随机样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您知道是否有一种方法可以让python的random.sample与生成器对象一起工作.我正在尝试从一个非常大的文本语料库中获取一个随机样本.问题是random.sample()引发以下错误.

Do you know if there is a way to get python's random.sample to work with a generator object. I am trying to get a random sample from a very large text corpus. The problem is that random.sample() raises the following error.

TypeError: object of type 'generator' has no len()

我当时在想,也许可以用itertools中的某种方法来做到这一点,但是经过一点搜索却找不到任何东西.

I was thinking that maybe there is some way of doing this with something from itertools but couldn't find anything with a bit of searching.

一个有些虚构的例子:

import random
def list_item(ls):
    for item in ls:
        yield item

random.sample( list_item(range(100)), 20 )



更新



UPDATE

根据MartinPieters的请求,我对当前提出的三种方法进行了一些计时.结果如下.

As per MartinPieters's request I did some timing of the currently proposed three methods. The results are as follows.

Sampling 1000 from 10000
Using iterSample 0.0163 s
Using sample_from_iterable 0.0098 s
Using iter_sample_fast 0.0148 s

Sampling 10000 from 100000
Using iterSample 0.1786 s
Using sample_from_iterable 0.1320 s
Using iter_sample_fast 0.1576 s

Sampling 100000 from 1000000
Using iterSample 3.2740 s
Using sample_from_iterable 1.9860 s
Using iter_sample_fast 1.4586 s

Sampling 200000 from 1000000
Using iterSample 7.6115 s
Using sample_from_iterable 3.0663 s
Using iter_sample_fast 1.4101 s

Sampling 500000 from 1000000
Using iterSample 39.2595 s
Using sample_from_iterable 4.9994 s
Using iter_sample_fast 1.2178 s

Sampling 2000000 from 5000000
Using iterSample 798.8016 s
Using sample_from_iterable 28.6618 s
Using iter_sample_fast 6.6482 s

因此,对于大样本量而言,array.insert有一个严重的缺陷.我用来计时方法的代码

So it turns out that the array.insert has a serious drawback when it comes to large sample sizes. The code I used to time the methods

from heapq import nlargest
import random
import timeit


def iterSample(iterable, samplesize):
    results = []
    for i, v in enumerate(iterable):
        r = random.randint(0, i)
        if r < samplesize:
            if i < samplesize:
                results.insert(r, v) # add first samplesize items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < samplesize:
        raise ValueError("Sample larger than population.")

    return results

def sample_from_iterable(iterable, samplesize):
    return (x for _, x in nlargest(samplesize, ((random.random(), x) for x in iterable)))

def iter_sample_fast(iterable, samplesize):
    results = []
    iterator = iter(iterable)
    # Fill in the first samplesize elements:
    for _ in xrange(samplesize):
        results.append(iterator.next())
    random.shuffle(results)  # Randomize their positions
    for i, v in enumerate(iterator, samplesize):
        r = random.randint(0, i)
        if r < samplesize:
            results[r] = v  # at a decreasing rate, replace random items

    if len(results) < samplesize:
        raise ValueError("Sample larger than population.")
    return results

if __name__ == '__main__':
    pop_sizes = [int(10e+3),int(10e+4),int(10e+5),int(10e+5),int(10e+5),int(10e+5)*5]
    k_sizes = [int(10e+2),int(10e+3),int(10e+4),int(10e+4)*2,int(10e+4)*5,int(10e+5)*2]

    for pop_size, k_size in zip(pop_sizes, k_sizes):
        pop = xrange(pop_size)
        k = k_size
        t1 = timeit.Timer(stmt='iterSample(pop, %i)'%(k_size), setup='from __main__ import iterSample,pop')
        t2 = timeit.Timer(stmt='sample_from_iterable(pop, %i)'%(k_size), setup='from __main__ import sample_from_iterable,pop')
        t3 = timeit.Timer(stmt='iter_sample_fast(pop, %i)'%(k_size), setup='from __main__ import iter_sample_fast,pop')

        print 'Sampling', k, 'from', pop_size
        print 'Using iterSample', '%1.4f s'%(t1.timeit(number=100) / 100.0)
        print 'Using sample_from_iterable', '%1.4f s'%(t2.timeit(number=100) / 100.0)
        print 'Using iter_sample_fast', '%1.4f s'%(t3.timeit(number=100) / 100.0)
        print ''

我还进行了一项测试,以检查所有方法是否确实都对发生器进行了无偏向采样.因此,对于所有方法,我从10000 100000次中采样了1000个元素,并计算出总体中每个项目出现的平均频率,结果发现这是~.1,这是所有三种方法所期望的.

I also ran a test to check that all the methods indeed do take an unbiased sample of the generator. So for all methods I sampled 1000 elements from 10000 100000 times and computed the average frequency of occurrence of each item in the population which turns out to be ~.1 as one would expect for all three methods.

推荐答案

尽管Martijn Pieters的答案是正确的,但当samplesize变大时,它的确会放慢速度,因为在循环中使用list.insert可能具有二次复杂度.

While the answer of Martijn Pieters is correct, it does slow down when samplesize becomes large, because using list.insert in a loop may have quadratic complexity.

在我看来,这是在提高性能的同时保持均匀性的另一种选择:

Here's an alternative that, in my opinion, preserves the uniformity while increasing performance:

def iter_sample_fast(iterable, samplesize):
    results = []
    iterator = iter(iterable)
    # Fill in the first samplesize elements:
    try:
        for _ in xrange(samplesize):
            results.append(iterator.next())
    except StopIteration:
        raise ValueError("Sample larger than population.")
    random.shuffle(results)  # Randomize their positions
    for i, v in enumerate(iterator, samplesize):
        r = random.randint(0, i)
        if r < samplesize:
            results[r] = v  # at a decreasing rate, replace random items
    return results

对于10000以上的samplesize值,差异逐渐开始显示.用(1000000, 100000)拨打电话的时间:

The difference slowly starts to show for samplesize values above 10000. Times for calling with (1000000, 100000):

  • iterSample:5.05秒
  • iter_sample_fast:2.64秒

这篇关于带有生成器/可迭代/迭代器的Python随机样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆