Python随机样本生成器(适用于庞大的人口规模) [英] Python random sample generator (comfortable with huge population sizes)

查看:175
本文介绍了Python随机样本生成器(适用于庞大的人口规模)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您可能知道random.sample(population,sample_size)会迅速返回一个随机样本,但是如果您事先不知道样本的大小怎么办?您最终将对整个总体进行采样或对其进行混洗,这是相同的.但这可能是浪费的(如果大多数样本量都小于总体数量),甚至是不可行的(如果总体数量巨大,内存不足).另外,如果您的代码需要在选择示例的下一个元素之前从此处跳到那里,那该怎么办?

As you might know random.sample(population,sample_size) quickly returns a random sample, but what if you don't know in advance the size of the sample? You end up in sampling the entire population, or shuffling it, which is the same. But this can be wasteful (if the majority of sample sizes come up to be small compared to population size) or even unfeasible (if population size is huge, running out of memory). Also, what if your code needs to jump from here to there before picking the next element of the sample?

P.S.我在模拟退火

P.S. I bumped into the need of optimizing random sample while working on simulated annealing for TSP. In my code sampling is restarted hundreds of thousands of times, and each time I don't know if I will need to pick 1 element or the 100% of the elements of population.

推荐答案

我相信这就是生成器的作用.这是一个通过生成器/收益率进行Fisher-Yates-Knuth采样的示例,您可以逐个获取事件,并在需要时停止.

That's what generators for, I believe. Here is an example of Fisher-Yates-Knuth sampling via generator/yield, you get events one by one and stop when you want to.

代码已更新

import random
import numpy
import array

class populationFYK(object):
    """
    Implementation of the Fisher-Yates-Knuth shuffle
    """
    def __init__(self, population):
        self._population = population      # reference to the population
        self._length     = len(population) # lengths of the sequence
        self._index      = len(population)-1 # last unsampled index
        self._popidx     = array.array('i', range(0,self._length))

        # array module vs numpy
        #self._popidx     = numpy.empty(self._length, dtype=numpy.int32)
        #for k in range(0,self._length):
        #    self._popidx[k] = k


    def swap(self, idx_a, idx_b):
        """
        Swap two elements in population
        """
        temp = self._popidx[idx_a]
        self._popidx[idx_a] = self._popidx[idx_b]
        self._popidx[idx_b] = temp

    def sample(self):
        """
        Yield one sampled case from population
        """
        while self._index >= 0:
            idx = random.randint(0, self._index) # index of the sampled event

            if idx != self._index:
                self.swap(idx, self._index)

            sampled = self._population[self._popidx[self._index]] # yielding it

            self._index -= 1 # one less to be sampled

            yield sampled

    def index(self):
        return self._index

    def restart(self):
        self._index = self._length - 1
        for k in range(0,self._length):
            self._popidx[k] = k

if __name__=="__main__":
    population = [1,3,6,8,9,3,2]

    gen = populationFYK(population)

    for k in gen.sample():
        print(k)

这篇关于Python随机样本生成器(适用于庞大的人口规模)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆