带有生成器/可迭代/迭代器的Python随机样本 [英] Python random sample with a generator / iterable / iterator
问题描述
您知道是否有一种方法可以让python的random.sample
与生成器对象一起工作.我正在尝试从一个非常大的文本语料库中获取一个随机样本.问题是random.sample()
引发以下错误.
Do you know if there is a way to get python's random.sample
to work with a generator object. I am trying to get a random sample from a very large text corpus. The problem is that random.sample()
raises the following error.
TypeError: object of type 'generator' has no len()
我当时在想,也许可以用itertools
中的某种方法来做到这一点,但是经过一点搜索却找不到任何东西.
I was thinking that maybe there is some way of doing this with something from itertools
but couldn't find anything with a bit of searching.
一个有些虚构的例子:
import random
def list_item(ls):
for item in ls:
yield item
random.sample( list_item(range(100)), 20 )
更新
UPDATE
根据MartinPieters
的请求,我对当前提出的三种方法进行了一些计时.结果如下.
As per MartinPieters
's request I did some timing of the currently proposed three methods. The results are as follows.
Sampling 1000 from 10000
Using iterSample 0.0163 s
Using sample_from_iterable 0.0098 s
Using iter_sample_fast 0.0148 s
Sampling 10000 from 100000
Using iterSample 0.1786 s
Using sample_from_iterable 0.1320 s
Using iter_sample_fast 0.1576 s
Sampling 100000 from 1000000
Using iterSample 3.2740 s
Using sample_from_iterable 1.9860 s
Using iter_sample_fast 1.4586 s
Sampling 200000 from 1000000
Using iterSample 7.6115 s
Using sample_from_iterable 3.0663 s
Using iter_sample_fast 1.4101 s
Sampling 500000 from 1000000
Using iterSample 39.2595 s
Using sample_from_iterable 4.9994 s
Using iter_sample_fast 1.2178 s
Sampling 2000000 from 5000000
Using iterSample 798.8016 s
Using sample_from_iterable 28.6618 s
Using iter_sample_fast 6.6482 s
因此,对于大样本量而言,array.insert
有一个严重的缺陷.我用来计时方法的代码
So it turns out that the array.insert
has a serious drawback when it comes to large sample sizes. The code I used to time the methods
from heapq import nlargest
import random
import timeit
def iterSample(iterable, samplesize):
results = []
for i, v in enumerate(iterable):
r = random.randint(0, i)
if r < samplesize:
if i < samplesize:
results.insert(r, v) # add first samplesize items in random order
else:
results[r] = v # at a decreasing rate, replace random items
if len(results) < samplesize:
raise ValueError("Sample larger than population.")
return results
def sample_from_iterable(iterable, samplesize):
return (x for _, x in nlargest(samplesize, ((random.random(), x) for x in iterable)))
def iter_sample_fast(iterable, samplesize):
results = []
iterator = iter(iterable)
# Fill in the first samplesize elements:
for _ in xrange(samplesize):
results.append(iterator.next())
random.shuffle(results) # Randomize their positions
for i, v in enumerate(iterator, samplesize):
r = random.randint(0, i)
if r < samplesize:
results[r] = v # at a decreasing rate, replace random items
if len(results) < samplesize:
raise ValueError("Sample larger than population.")
return results
if __name__ == '__main__':
pop_sizes = [int(10e+3),int(10e+4),int(10e+5),int(10e+5),int(10e+5),int(10e+5)*5]
k_sizes = [int(10e+2),int(10e+3),int(10e+4),int(10e+4)*2,int(10e+4)*5,int(10e+5)*2]
for pop_size, k_size in zip(pop_sizes, k_sizes):
pop = xrange(pop_size)
k = k_size
t1 = timeit.Timer(stmt='iterSample(pop, %i)'%(k_size), setup='from __main__ import iterSample,pop')
t2 = timeit.Timer(stmt='sample_from_iterable(pop, %i)'%(k_size), setup='from __main__ import sample_from_iterable,pop')
t3 = timeit.Timer(stmt='iter_sample_fast(pop, %i)'%(k_size), setup='from __main__ import iter_sample_fast,pop')
print 'Sampling', k, 'from', pop_size
print 'Using iterSample', '%1.4f s'%(t1.timeit(number=100) / 100.0)
print 'Using sample_from_iterable', '%1.4f s'%(t2.timeit(number=100) / 100.0)
print 'Using iter_sample_fast', '%1.4f s'%(t3.timeit(number=100) / 100.0)
print ''
我还进行了一项测试,以检查所有方法是否确实都对发生器进行了无偏向采样.因此,对于所有方法,我从10000
100000
次中采样了1000
个元素,并计算出总体中每个项目出现的平均频率,结果发现这是~.1
,这是所有三种方法所期望的.>
I also ran a test to check that all the methods indeed do take an unbiased sample of the generator. So for all methods I sampled 1000
elements from 10000
100000
times and computed the average frequency of occurrence of each item in the population which turns out to be ~.1
as one would expect for all three methods.
推荐答案
尽管Martijn Pieters的答案是正确的,但当samplesize
变大时,它的确会放慢速度,因为在循环中使用list.insert
可能具有二次复杂度.
While the answer of Martijn Pieters is correct, it does slow down when samplesize
becomes large, because using list.insert
in a loop may have quadratic complexity.
在我看来,这是在提高性能的同时保持均匀性的另一种选择:
Here's an alternative that, in my opinion, preserves the uniformity while increasing performance:
def iter_sample_fast(iterable, samplesize):
results = []
iterator = iter(iterable)
# Fill in the first samplesize elements:
try:
for _ in xrange(samplesize):
results.append(iterator.next())
except StopIteration:
raise ValueError("Sample larger than population.")
random.shuffle(results) # Randomize their positions
for i, v in enumerate(iterator, samplesize):
r = random.randint(0, i)
if r < samplesize:
results[r] = v # at a decreasing rate, replace random items
return results
对于10000
以上的samplesize
值,差异逐渐开始显示.用(1000000, 100000)
拨打电话的时间:
The difference slowly starts to show for samplesize
values above 10000
. Times for calling with (1000000, 100000)
:
- iterSample:5.05秒
- iter_sample_fast:2.64秒
这篇关于带有生成器/可迭代/迭代器的Python随机样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!