在python加权随机样本 [英] Weighted random sample in python

查看:1186
本文介绍了在python加权随机样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个函数 weighted_sample 的合理界定,不给定重量的名单(这会是这样

I'm looking for a reasonable definition of a function weighted_sample that does not return just one random index for a list of given weights (which would be something like

def weighted_choice(weights, random=random):
    """ Given a list of weights [w_0, w_1, ..., w_n-1],
        return an index i in range(n) with probability proportional to w_i. """
    rnd = random.random() * sum(weights)
    for i, w in enumerate(weights):
        if w<0:
            raise ValueError("Negative weight encountered.")
        rnd -= w
        if rnd < 0:
            return i
    raise ValueError("Sum of weights is not positive")

要给出一个明确的分布常数权重),但 K 那些随机抽样,无需更换,就像 random.sample 的行为方式相比, random.choice

to give a categorical distribution with constant weights) but a random sample of k of those, without replacement, just as random.sample behaves compared to random.choice.

正如 weighted_choice 可以写成

lambda weights: random.choice([val for val, cnt in enumerate(weights)
    for i in range(cnt)])

weighted_sample 可以写成

lambda weights, k: random.sample([val for val, cnt in enumerate(weights)
    for i in range(cnt)], k)

但我想不需要我去解开这个权重成(可能是巨大的)列表中选择一个解决方案。

but I would like a solution that does not require me to unravel the weights into a (possibly huge) list.

编辑:如果有任何好的算法,给我回的频率的柱状图/列表(在相同的格式参数权重),而不是指数序列,这也将是非常有用的。

If there are any nice algorithms that give me back a histogram/list of frequencies (in the same format as the argument weights) instead of a sequence of indices, that would also be very useful.

推荐答案

从code:..

weight_sample_indexes = lambda weights, k: random.sample([val 
        for val, cnt in enumerate(weights) for i in range(cnt)], k)

。我认为权重是正整数,以无需更换你不用更换为拆开的序列意思。

.. I assume that weights are positive integers and by "without replacement" you mean without replacement for the unraveled sequence.

下面是一个基于random.sample和O(log n)的一个解决方案 __ __的GetItem

Here's a solution based on random.sample and O(log n) __getitem__:

import bisect
import random
from collections import Counter, Sequence

def weighted_sample(population, weights, k):
    return random.sample(WeightedPopulation(population, weights), k)

class WeightedPopulation(Sequence):
    def __init__(self, population, weights):
        assert len(population) == len(weights) > 0
        self.population = population
        self.cumweights = []
        cumsum = 0 # compute cumulative weight
        for w in weights:
            cumsum += w   
            self.cumweights.append(cumsum)  
    def __len__(self):
        return self.cumweights[-1]
    def __getitem__(self, i):
        if not 0 <= i < len(self):
            raise IndexError(i)
        return self.population[bisect.bisect(self.cumweights, i)]

示例

total = Counter()
for _ in range(1000):
    sample = weighted_sample("abc", [1,10,2], 5)
    total.update(sample)
print(sample)
print("Frequences %s" % (dict(Counter(sample)),))

# Check that values are sane
print("Total " + ', '.join("%s: %.0f" % (val, count * 1.0 / min(total.values()))
                           for val, count in total.most_common()))

输出

['b', 'b', 'b', 'c', 'c']
Frequences {'c': 2, 'b': 3}
Total b: 10, c: 2, a: 1

这篇关于在python加权随机样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆