Python 中的概率分布 [英] Probability distribution in Python

查看:27
本文介绍了Python 中的概率分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆键,每个键都有一个不太可能的变量.我想随机选择这些键之一,但我希望不太可能(键,值)被选择的可能性比不太可能(更有可能)的对象更不可能.我想知道您是否有任何建议,最好是我可以使用的现有 python 模块,否则我需要自己制作.

我已经检查了随机模块;它似乎没有提供这个.

我必须为 1000 组不同的对象做出数百万次这样的选择,每组包含 2,455 个对象.每个集合将相互交换对象,因此随机选择器需要是动态的.1000套2433个对象,即24.33亿个对象;低内存消耗至关重要.由于这些选择不是算法的主要部分,我需要这个过程非常快;CPU 时间有限.

谢谢

更新:

好的,我试着明智地考虑你的建议,但时间太有限了......

我查看了二叉搜索树方法,它似乎太冒险(复杂和复杂).其他建议都类似于 ActiveState 配方.我拿来修改了一下,希望能提高效率:

def windex(dict, sum, max):'''尝试制作一个 random.choose() 函数,它使加权选择接受带有 item_key 和的字典确定性值作为一对像:>>>x = [('one', 20), ('two', 2), ('three', 50)],最大确定性值(max)和所有确定性的总和.'''n = random.uniform(0, 1)sum = max*len(list)-sum对于关键,dict.iteritems() 中的确定性:重量 = 浮点数(最大确定性)/总和如果 n <重量:休息n = n - 重量返回键

我希望通过动态维护确定性总和和最大确定性来提高效率.欢迎任何进一步的建议.你们省了我这么多时间和精力,同时提高了我的效率,真是太疯狂了.谢谢!谢谢!谢谢!

更新 2:

我决定让它一次选择更多选择,从而提高效率.这将在我的算法中导致可接受的精度损失,因为它本质上是动态的.无论如何,这就是我现在所拥有的:

def weightedChoices(dict, sum, max, choice=10):'''尝试制作一个 random.choose() 函数,它使加权选择接受带有 item_key 和的字典确定性值作为一对像:>>>x = [('one', 20), ('two', 2), ('three', 50)],最大确定性值(max)和所有确定性的总和.'''list = [random.uniform(0, 1) for i in range(choices)](n, list) = relavate(list.sort())键 = []sum = max*len(list)-sum对于关键,dict.iteritems() 中的确定性:重量 = 浮点数(最大确定性)/总和如果 n <重量:键.附加(键)if list: (n, list) = relavate(list)其他:休息n = n - 重量返回键定义相关(列表):分钟 = 列表 [0]new = [l - min for l in list[1:]]返回(分钟,新)

我还没试过.如果您有任何意见/建议,请不要犹豫.谢谢!

更新 3:

我一整天都在研究 Rex Logan 答案的任务定制版本.它实际上是一个特殊的字典类,而不是 2 个对象和权重数组;这使得事情变得非常复杂,因为 Rex 的代码生成了一个随机索引......我还编写了一个测试用例,它类似于我的算法中会发生的事情(但我在尝试之前无法真正知道!).基本原理是:一个key随机生成的次数越多,再次生成的可能性就越大:

导入随机,时间进口心理psyco.full()类 ProbDict():"""Rex Logans RandomObject 类的修改版本.越多一个key是随机的选择,它就越不可能被进一步随机选择."""def __init__(self,keys_weights_values={}):self._kw=keys_weights_valuesself._keys=self._kw.keys()self._len=len(self._keys)self._findSeniors()自我._努力= 0.15self._fails = 0def __iter__(self):返回 self.next()def __getitem__(self, key):返回 self._kw[key]def __setitem__(self, key, value):self.append(key, value)def __len__(self):返回 self._len定义下一个(自己):键=self._key()而关键:屈服键键 = self._key()def __contains__(self, key):self._kw 中的返回键定义项目(自我):返回 self._kw.items()def pop(self, key):尝试:(w, value) = self._kw.pop(key)self._len -=1如果 w == self._seniorW:self._seniors -= 1如果不是 self._seniors:#昂贵但不太可能:self._findSeniors()返回 [w,值]除了 KeyError:返回无def popitem(自我):返回 self.pop(self._key())定义值(自我):值 = []对于 self._keys 中的键:尝试:values.append(self._kw[key][1])除了 KeyError:经过返回值定义权重(自我):权重 = []对于 self._keys 中的键:尝试:weights.append(self._kw[key][0])除了 KeyError:经过返回权重定义键(自我,不完美=假):如果不完美:返回 self._keys返回 self._kw.keys()def append(self, key, value=None):如果键不在 self._kw 中:self._len +=1self._kw[key] = [0, value]self._keys.append(key)别的:self._kw[key][1]=valuedef_key(self):对于范围内的 i(int(self._effort*self._len)):ri=random.randint(0,self._len-1) #选择一个随机对象rx=random.uniform(0,self._seniorW)rkey = self._keys[ri]尝试:w = self._kw[rkey][0]if rx >= w: # 测试看看这是否是我们想要的值w += 1self._warnSeniors(w)self._kw[rkey][0] = w返回密钥除了 KeyError:self._keys.pop(ri)# 如果尝试 100 次后仍未找到,则随机获取一个self._fails += 1 #仅用于确认有效性对于 self._keys 中的键:如果输入 self._kw:w = self._kw[key][0] + 1self._warnSeniors(w)self._kw[key][0] = w返回键返回无def _findSeniors(self):'''这个功能找到老年人,计算他们并评估他们的年龄.它昂贵但不太可能.'''高级W = 0老年人 = 0对于 self._kw.itervalues() 中的 w:如果 w >= SeniorW:如果 w == seniorW:老年人 += 1别的:老年人W = w老年人 = 1self._seniors = 老年人self._seniorW = SeniorWdef _warnSeniors(self, w):#权重只能增加...好如果 w >= self._seniorW:如果 w == self._seniorW:self._seniors+=1别的:self._seniors = 1self._seniorW = w定义测试():#测试代码迭代次数 = 200000大小 = 2500nextkey = 大小pd = ProbDict(dict([(i,[0,i]) for i in xrange(size)]))开始 = time.clock()对于 xrange(迭代)中的 i:键=pd._key()w=pd[key][0]如果 random.randint(0,1+pd._seniorW-w):#物体越重,就越不可能被移除pd.pop(键)probAppend = float(500+(size-len(pd)))/1000如果 random.uniform(0,1) 

仍然欢迎任何评论.@Darius:你的二叉树对我来说太复杂了;而且我不认为它的叶子可以有效地去除... Thx all

解决方案

这个 activestate recipe提供了一种易于遵循的方法,特别是评论中不需要您预先规范化权重的版本:

随机导入def weighted_choice(items):"""items 是形式为 (item, weight)""" 的元组列表weight_total = sum((item[1] for item in items))n = random.uniform(0, weight_total)对于物品,物品重量:如果 n <重量:归还物品n = n - 重量归还物品

如果您有大量项目,这会很慢.在这种情况下,二进制搜索可能会更好……但编写起来也会更复杂,如果样本量很小,则收益很小.这里有一个 Python 二进制搜索方法的例子,如果你想遵循这条路线.>

(我建议在您的数据集上对这两种方法进行一些快速的性能测试.这种算法的不同方法的性能通常有点不直观.)

<小时>

我接受了自己的建议,因为我很好奇,并做了一些测试.

我比较了四种方法:

上面的 weighted_choice 函数.

像这样的二分搜索选择函数:

def weighted_choice_bisect(items):added_weights = []最后总和 = 0对于物品,物品重量:last_sum += 权重added_weights.append(last_sum)返回项目[bisect.bisect( added_weights, random.random() * last_sum)][0]

1 的编译版本:

def weighted_choice_compile(items):""" 返回一个从项目中获取随机项目的函数items 是形式为 (item, weight)""" 的元组列表weight_total = sum((item[1] for item in items))定义选择(统一=随机.统一):n = 统一(0,weight_total)对于物品,物品重量:如果 n <重量:归还物品n = n - 重量归还物品返回选择

2 的编译版本:

def weighted_choice_bisect_compile(items):"""返回一个函数,该函数从项目中进行加权随机选择."""added_weights = []最后总和 = 0对于物品,物品重量:last_sum += 权重added_weights.append(last_sum)定义选择(rnd=random.random, bis=bisect.bisect):返回项目[bis( added_weights, rnd() * last_sum)][0]返回选择

然后我建立了一个很大的选择列表,如下所示:

choices = [(random.choice("abcdefg"), random.uniform(0,50)) for i in xrange(2500)]

还有一个过于简单的分析函数:

def profiler(f, n, *args, **kwargs):开始 = time.time()对于 xrange(n) 中的 i:f(*args, **kwargs)返回 time.time() - 开始

结果:

(调用该函数 1,000 次所用的秒数.)

  • 简单的未编译:0.918624162674
  • 未编译的二进制文件:1.01497793198
  • 简单编译:0.287325024605
  • 二进制编译:0.00327413797379

编译"结果包括编译一次选择函数所花费的平均时间.(我对 1,000 次编译进行计时,然后将该时间除以 1,000,并将结果添加到选择函数时间.)

所以:如果你有一个很少改变的项目+权重列表,二进制编译方法是到目前为止最快的.

I have a bunch of keys that each have an unlikeliness variable. I want to randomly choose one of these keys, yet I want it to be more unlikely for unlikely (key, values) to be chosen than a less unlikely (a more likely) object. I am wondering if you would have any suggestions, preferably an existing python module that I could use, else I will need to make it myself.

I have checked out the random module; it does not seem to provide this.

I have to make such choices many millions of times for 1000 different sets of objects each containing 2,455 objects. Each set will exchange objects among each other so the random chooser needs to be dynamic. With 1000 sets of 2,433 objects, that is 2,433 million objects; low memory consumption is crucial. And since these choices are not the bulk of the algorithm, I need this process to be quite fast; CPU-time is limited.

Thx

Update:

Ok, I tried to consider your suggestions wisely, but time is so limited...

I looked at the binary search tree approach and it seems too risky (complex and complicated). The other suggestions all resemble the ActiveState recipe. I took it and modified it a little in the hope of making more efficient:

def windex(dict, sum, max):
    '''an attempt to make a random.choose() function that makes
    weighted choices accepts a dictionary with the item_key and
    certainty_value as a pair like:
    >>> x = [('one', 20), ('two', 2), ('three', 50)], the
    maximum certainty value (max) and the sum of all certainties.'''
    n = random.uniform(0, 1)
    sum = max*len(list)-sum 
    for key, certainty in dict.iteritems():
        weight = float(max-certainty)/sum
        if n < weight:
            break
        n = n - weight
    return key

I am hoping to get an efficiency gain from dynamically maintaining the sum of certainties and the maximum certainty. Any further suggestions are welcome. You guys saves me so much time and effort, while increasing my effectiveness, it is crazy. Thx! Thx! Thx!

Update2:

I decided to make it more efficient by letting it choose more choices at once. This will result in an acceptable loss in precision in my algo for it is dynamic in nature. Anyway, here is what I have now:

def weightedChoices(dict, sum, max, choices=10):
    '''an attempt to make a random.choose() function that makes
    weighted choices accepts a dictionary with the item_key and
    certainty_value as a pair like:
    >>> x = [('one', 20), ('two', 2), ('three', 50)], the
    maximum certainty value (max) and the sum of all certainties.'''
    list = [random.uniform(0, 1) for i in range(choices)]
    (n, list) = relavate(list.sort())
    keys = []
    sum = max*len(list)-sum 
    for key, certainty in dict.iteritems():
        weight = float(max-certainty)/sum
        if n < weight:
            keys.append(key)
            if list: (n, list) = relavate(list)
            else: break
        n = n - weight
    return keys
def relavate(list):
    min = list[0]
    new = [l - min for l in list[1:]]
    return (min, new)

I haven't tried it out yet. If you have any comments/suggestions, please do not hesitate. Thx!

Update3:

I have been working all day on a task-tailored version of Rex Logan's answer. Instead of a 2 arrays of objects and weights, it is actually a special dictionary class; which makes things quite complex since Rex's code generates a random index... I also coded a test case that kind of resembles what will happen in my algo (but I can't really know until I try!). The basic principle is: the more a key is randomly generated often, the more unlikely it will be generated again:

import random, time
import psyco
psyco.full()

class ProbDict():
    """
    Modified version of Rex Logans RandomObject class. The more a key is randomly
    chosen, the more unlikely it will further be randomly chosen. 
    """
    def __init__(self,keys_weights_values={}):
        self._kw=keys_weights_values
        self._keys=self._kw.keys()
        self._len=len(self._keys)
        self._findSeniors()
        self._effort = 0.15
        self._fails = 0
    def __iter__(self):
        return self.next()
    def __getitem__(self, key):
        return self._kw[key]
    def __setitem__(self, key, value):
        self.append(key, value)
    def __len__(self):
        return self._len
    def next(self):
        key=self._key()
        while key:
            yield key
            key = self._key()
    def __contains__(self, key):
        return key in self._kw
    def items(self):
        return self._kw.items()
    def pop(self, key):  
        try:
            (w, value) = self._kw.pop(key)
            self._len -=1
            if w == self._seniorW:
                self._seniors -= 1
                if not self._seniors:
                    #costly but unlikely:
                    self._findSeniors()
            return [w, value]
        except KeyError:
            return None
    def popitem(self):
        return self.pop(self._key())
    def values(self):
        values = []
        for key in self._keys:
            try:
                values.append(self._kw[key][1])
            except KeyError:
                pass
        return values
    def weights(self):
        weights = []
        for key in self._keys:
            try:
                weights.append(self._kw[key][0])
            except KeyError:
                pass
        return weights
    def keys(self, imperfect=False):
        if imperfect: return self._keys
        return self._kw.keys()
    def append(self, key, value=None):
        if key not in self._kw:
            self._len +=1
            self._kw[key] = [0, value]
            self._keys.append(key)
        else:
            self._kw[key][1]=value
    def _key(self):
        for i in range(int(self._effort*self._len)):
            ri=random.randint(0,self._len-1) #choose a random object
            rx=random.uniform(0,self._seniorW)
            rkey = self._keys[ri]
            try:
                w = self._kw[rkey][0]
                if rx >= w: # test to see if that is the value we want
                    w += 1
                    self._warnSeniors(w)
                    self._kw[rkey][0] = w
                    return rkey
            except KeyError:
                self._keys.pop(ri)
        # if you do not find one after 100 tries then just get a random one
        self._fails += 1 #for confirming effectiveness only
        for key in self._keys:
            if key in self._kw:
                w = self._kw[key][0] + 1
                self._warnSeniors(w)
                self._kw[key][0] = w
                return key
        return None
    def _findSeniors(self):
        '''this function finds the seniors, counts them and assess their age. It
        is costly but unlikely.'''
        seniorW = 0
        seniors = 0
        for w in self._kw.itervalues():
            if w >= seniorW:
                if w == seniorW:
                    seniors += 1
                else:
                    seniorsW = w
                    seniors = 1
        self._seniors = seniors
        self._seniorW = seniorW
    def _warnSeniors(self, w):
        #a weight can only be incremented...good
        if w >= self._seniorW:
            if w == self._seniorW:
                self._seniors+=1
            else:
                self._seniors = 1
                self._seniorW = w
def test():
    #test code
    iterations = 200000
    size = 2500
    nextkey = size 


    pd = ProbDict(dict([(i,[0,i]) for i in xrange(size)]))
    start = time.clock()
    for i in xrange(iterations):
        key=pd._key()
        w=pd[key][0]
        if random.randint(0,1+pd._seniorW-w):
            #the heavier the object, the more unlikely it will be removed
            pd.pop(key)
        probAppend = float(500+(size-len(pd)))/1000
        if random.uniform(0,1) < probAppend:
            nextkey+=1
            pd.append(nextkey)
    print (time.clock()-start)*1000/iterations, "msecs / iteration with", pd._fails, "failures /", iterations, "iterations"
    weights = pd.weights()
    weights.sort()
    print "avg weight:", float(sum(weights))/pd._len, max(weights), pd._seniorW, pd._seniors, len(pd), len(weights)
    print weights
test()

Any comments are still welcome. @Darius: your binary trees are too complex and complicated for me; and I do not think its leafs can be removed efficiently... Thx all

解决方案

This activestate recipe gives an easy-to-follow approach, specifically the version in the comments that doesn't require you to pre-normalize your weights:

import random

def weighted_choice(items):
    """items is a list of tuples in the form (item, weight)"""
    weight_total = sum((item[1] for item in items))
    n = random.uniform(0, weight_total)
    for item, weight in items:
        if n < weight:
            return item
        n = n - weight
    return item

This will be slow if you have a large list of items. A binary search would probably be better in that case... but would also be more complicated to write, for little gain if you have a small sample size. Here's an example of the binary search approach in python if you want to follow that route.

(I'd recommend doing some quick performance testing of both methods on your dataset. The performance of different approaches to this sort of algorithm is often a bit unintuitive.)


Edit: I took my own advice, since I was curious, and did a few tests.

I compared four approaches:

The weighted_choice function above.

A binary-search choice function like so:

def weighted_choice_bisect(items):
    added_weights = []
    last_sum = 0

    for item, weight in items:
        last_sum += weight
        added_weights.append(last_sum)

    return items[bisect.bisect(added_weights, random.random() * last_sum)][0]

A compiling version of 1:

def weighted_choice_compile(items):
    """returns a function that fetches a random item from items

    items is a list of tuples in the form (item, weight)"""
    weight_total = sum((item[1] for item in items))
    def choice(uniform = random.uniform):
        n = uniform(0, weight_total)
        for item, weight in items:
            if n < weight:
                return item
            n = n - weight
        return item
    return choice

A compiling version of 2:

def weighted_choice_bisect_compile(items):
    """Returns a function that makes a weighted random choice from items."""
    added_weights = []
    last_sum = 0

    for item, weight in items:
        last_sum += weight
        added_weights.append(last_sum)

    def choice(rnd=random.random, bis=bisect.bisect):
        return items[bis(added_weights, rnd() * last_sum)][0]
    return choice

I then built a big list of choices like so:

choices = [(random.choice("abcdefg"), random.uniform(0,50)) for i in xrange(2500)]

And an excessively simple profiling function:

def profiler(f, n, *args, **kwargs):
    start = time.time()
    for i in xrange(n):
        f(*args, **kwargs)
    return time.time() - start

The results:

(Seconds taken for 1,000 calls to the function.)

  • Simple uncompiled: 0.918624162674
  • Binary uncompiled: 1.01497793198
  • Simple compiled: 0.287325024605
  • Binary compiled: 0.00327413797379

The "compiled" results include the average time taken to compile the choice function once. (I timed 1,000 compiles, then divided that time by 1,000, and added the result to the choice function time.)

So: if you have a list of items+weights which change very rarely, the binary compiled method is by far the fastest.

这篇关于Python 中的概率分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆