在Python概率分布 [英] Probability distribution in Python

查看:201
本文介绍了在Python概率分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一大堆,每个有unlikeliness可变密钥。我想随机选择其中一个密钥,但我希望它是更不可能的可能性不大(键,值)比少不可能(更可能的)对象进行选择。我想知道,如果你有任何建议,preferably现有的蟒蛇,我可以用模块,否则我会需要做我自己。

I have a bunch of keys that each have an unlikeliness variable. I want to randomly choose one of these keys, yet I want it to be more unlikely for unlikely (key, values) to be chosen than a less unlikely (a more likely) object. I am wondering if you would have any suggestions, preferably an existing python module that I could use, else I will need to make it myself.

我已签出随机模块;它似乎并没有提供这一点。

I have checked out the random module; it does not seem to provide this.

我不得不做出这样的选择,很多数百万次的1000不同的组,每个组包含2,455对象的对象。每一组将在彼此之间交换对象所以随机选择器需要是动态的。随着1000套2433的对象,也就是2433亿个对象;低内存占用是至关重要的。而且,由于这些选择都不是散装的算法,我需要这个过程是相当快; CPU时间是有限的。

I have to make such choices many millions of times for 1000 different sets of objects each containing 2,455 objects. Each set will exchange objects among each other so the random chooser needs to be dynamic. With 1000 sets of 2,433 objects, that is 2,433 million objects; low memory consumption is crucial. And since these choices are not the bulk of the algorithm, I need this process to be quite fast; CPU-time is limited.

THX

更新:

好吧,我想明智地考虑您的建议,但时间很有限......

Ok, I tried to consider your suggestions wisely, but time is so limited...

我看着二叉搜索树的方法,它似乎太冒险(纷繁复杂的)。其他建议所有类似的ActiveState食谱。我在做更有效率,希望把它和修改它一点:

I looked at the binary search tree approach and it seems too risky (complex and complicated). The other suggestions all resemble the ActiveState recipe. I took it and modified it a little in the hope of making more efficient:

def windex(dict, sum, max):
    '''an attempt to make a random.choose() function that makes
    weighted choices accepts a dictionary with the item_key and
    certainty_value as a pair like:
    >>> x = [('one', 20), ('two', 2), ('three', 50)], the
    maximum certainty value (max) and the sum of all certainties.'''
    n = random.uniform(0, 1)
    sum = max*len(list)-sum 
    for key, certainty in dict.iteritems():
        weight = float(max-certainty)/sum
        if n < weight:
            break
        n = n - weight
    return key

我希望从动态保持确定性的总和,最大确定性获得效率增益。任何进一步的建议,欢迎。你们节省了我这么多时间和精力,同时提高我的成效,这是疯了。谢谢!谢谢! THX!

I am hoping to get an efficiency gain from dynamically maintaining the sum of certainties and the maximum certainty. Any further suggestions are welcome. You guys saves me so much time and effort, while increasing my effectiveness, it is crazy. Thx! Thx! Thx!

UPDATE2:

我决定通过让它选择更多的选择同时也更有效。这将导致在precision可以接受的损失在我的算法中,因为这是动态的性质。总之,这里是我现在有:

I decided to make it more efficient by letting it choose more choices at once. This will result in an acceptable loss in precision in my algo for it is dynamic in nature. Anyway, here is what I have now:

def weightedChoices(dict, sum, max, choices=10):
    '''an attempt to make a random.choose() function that makes
    weighted choices accepts a dictionary with the item_key and
    certainty_value as a pair like:
    >>> x = [('one', 20), ('two', 2), ('three', 50)], the
    maximum certainty value (max) and the sum of all certainties.'''
    list = [random.uniform(0, 1) for i in range(choices)]
    (n, list) = relavate(list.sort())
    keys = []
    sum = max*len(list)-sum 
    for key, certainty in dict.iteritems():
        weight = float(max-certainty)/sum
        if n < weight:
            keys.append(key)
            if list: (n, list) = relavate(list)
            else: break
        n = n - weight
    return keys
def relavate(list):
    min = list[0]
    new = [l - min for l in list[1:]]
    return (min, new)

我还没有尝试过了呢。如果您有任何意见/建议,请不要犹豫。 THX!

I haven't tried it out yet. If you have any comments/suggestions, please do not hesitate. Thx!

UPDATE3:

我一直在工作了一整天在雷克斯洛根的回答任务量身定制的版本。取而代之的是2个数组对象和权重的,它实际上是一个特殊的字典类;这使事情变得非常复杂,因为雷克斯的code生成一个随机指数...我也codeD测试用例一种类似于将在我的算法中发生的事情(但我真的不知道,直到我试试吧!) 。其基本原理是:越是密钥是随机生成的常,越不可能它将被再次生成:

I have been working all day on a task-tailored version of Rex Logan's answer. Instead of a 2 arrays of objects and weights, it is actually a special dictionary class; which makes things quite complex since Rex's code generates a random index... I also coded a test case that kind of resembles what will happen in my algo (but I can't really know until I try!). The basic principle is: the more a key is randomly generated often, the more unlikely it will be generated again:

import random, time
import psyco
psyco.full()

class ProbDict():
    """
    Modified version of Rex Logans RandomObject class. The more a key is randomly
    chosen, the more unlikely it will further be randomly chosen. 
    """
    def __init__(self,keys_weights_values={}):
        self._kw=keys_weights_values
        self._keys=self._kw.keys()
        self._len=len(self._keys)
        self._findSeniors()
        self._effort = 0.15
        self._fails = 0
    def __iter__(self):
        return self.next()
    def __getitem__(self, key):
        return self._kw[key]
    def __setitem__(self, key, value):
        self.append(key, value)
    def __len__(self):
        return self._len
    def next(self):
        key=self._key()
        while key:
            yield key
            key = self._key()
    def __contains__(self, key):
        return key in self._kw
    def items(self):
        return self._kw.items()
    def pop(self, key):  
        try:
            (w, value) = self._kw.pop(key)
            self._len -=1
            if w == self._seniorW:
                self._seniors -= 1
                if not self._seniors:
                    #costly but unlikely:
                    self._findSeniors()
            return [w, value]
        except KeyError:
            return None
    def popitem(self):
        return self.pop(self._key())
    def values(self):
        values = []
        for key in self._keys:
            try:
                values.append(self._kw[key][1])
            except KeyError:
                pass
        return values
    def weights(self):
        weights = []
        for key in self._keys:
            try:
                weights.append(self._kw[key][0])
            except KeyError:
                pass
        return weights
    def keys(self, imperfect=False):
        if imperfect: return self._keys
        return self._kw.keys()
    def append(self, key, value=None):
        if key not in self._kw:
            self._len +=1
            self._kw[key] = [0, value]
            self._keys.append(key)
        else:
            self._kw[key][1]=value
    def _key(self):
        for i in range(int(self._effort*self._len)):
            ri=random.randint(0,self._len-1) #choose a random object
            rx=random.uniform(0,self._seniorW)
            rkey = self._keys[ri]
            try:
                w = self._kw[rkey][0]
                if rx >= w: # test to see if that is the value we want
                    w += 1
                    self._warnSeniors(w)
                    self._kw[rkey][0] = w
                    return rkey
            except KeyError:
                self._keys.pop(ri)
        # if you do not find one after 100 tries then just get a random one
        self._fails += 1 #for confirming effectiveness only
        for key in self._keys:
            if key in self._kw:
                w = self._kw[key][0] + 1
                self._warnSeniors(w)
                self._kw[key][0] = w
                return key
        return None
    def _findSeniors(self):
        '''this function finds the seniors, counts them and assess their age. It
        is costly but unlikely.'''
        seniorW = 0
        seniors = 0
        for w in self._kw.itervalues():
            if w >= seniorW:
                if w == seniorW:
                    seniors += 1
                else:
                    seniorsW = w
                    seniors = 1
        self._seniors = seniors
        self._seniorW = seniorW
    def _warnSeniors(self, w):
        #a weight can only be incremented...good
        if w >= self._seniorW:
            if w == self._seniorW:
                self._seniors+=1
            else:
                self._seniors = 1
                self._seniorW = w
def test():
    #test code
    iterations = 200000
    size = 2500
    nextkey = size 


    pd = ProbDict(dict([(i,[0,i]) for i in xrange(size)]))
    start = time.clock()
    for i in xrange(iterations):
        key=pd._key()
        w=pd[key][0]
        if random.randint(0,1+pd._seniorW-w):
            #the heavier the object, the more unlikely it will be removed
            pd.pop(key)
        probAppend = float(500+(size-len(pd)))/1000
        if random.uniform(0,1) < probAppend:
            nextkey+=1
            pd.append(nextkey)
    print (time.clock()-start)*1000/iterations, "msecs / iteration with", pd._fails, "failures /", iterations, "iterations"
    weights = pd.weights()
    weights.sort()
    print "avg weight:", float(sum(weights))/pd._len, max(weights), pd._seniorW, pd._seniors, len(pd), len(weights)
    print weights
test()

任何意见仍然欢迎。 @Darius:您的二叉树过于复杂,复杂的我;我不认为它的​​叶子可以有效地去除... THX所有

Any comments are still welcome. @Darius: your binary trees are too complex and complicated for me; and I do not think its leafs can be removed efficiently... Thx all

推荐答案

这ActiveState的配方给出了一个易于遵循的方法,具体的版本中的注释,这并不需要你pre-正常化您的重量:

This activestate recipe gives an easy-to-follow approach, specifically the version in the comments that doesn't require you to pre-normalize your weights:

import random

def weighted_choice(items):
    """items is a list of tuples in the form (item, weight)"""
    weight_total = sum((item[1] for item in items))
    n = random.uniform(0, weight_total)
    for item, weight in items:
        if n < weight:
            return item
        n = n - weight
    return item

如果您有一个项目的大名单这将是缓慢的。二进制搜索可能会是在这种情况下更好......而且是写比较复杂,而收效甚微,如果你有一个小样本。 下面是用Python 二进制搜索方法的一个例子,如果你想遵循这条路线。

This will be slow if you have a large list of items. A binary search would probably be better in that case... but would also be more complicated to write, for little gain if you have a small sample size. Here's an example of the binary search approach in python if you want to follow that route.

(我建议你做的这两种方法的一些快速的性能测试你的数据集。不同的方法来这种算法的性能往往是有点不直观。)

(I'd recommend doing some quick performance testing of both methods on your dataset. The performance of different approaches to this sort of algorithm is often a bit unintuitive.)


编辑:我把我自己的意见,因为我很好奇,并做了一些测试

I took my own advice, since I was curious, and did a few tests.

我比较了四种方法:

*的weighted_choice功能上面。*

*The weighted_choice function above.*

二进制搜索选择功能,像这样:

def weighted_choice_bisect(items):
    added_weights = []
    last_sum = 0

    for item, weight in items:
        last_sum += weight
        added_weights.append(last_sum)

    return items[bisect.bisect(added_weights, random.random() * last_sum)][0]

的1编译的版本:

def weighted_choice_compile(items):
    """returns a function that fetches a random item from items

    items is a list of tuples in the form (item, weight)"""
    weight_total = sum((item[1] for item in items))
    def choice(uniform = random.uniform):
        n = uniform(0, weight_total)
        for item, weight in items:
            if n < weight:
                return item
            n = n - weight
        return item
    return choice

将2编译的版本:

def weighted_choice_bisect_compile(items):
    """Returns a function that makes a weighted random choice from items."""
    added_weights = []
    last_sum = 0

    for item, weight in items:
        last_sum += weight
        added_weights.append(last_sum)

    def choice(rnd=random.random, bis=bisect.bisect):
        return items[bis(added_weights, rnd() * last_sum)][0]
    return choice

我再建的像这样选择的大名单:

I then built a big list of choices like so:

choices = [(random.choice("abcdefg"), random.uniform(0,50)) for i in xrange(2500)]

和过分简单的分析功能:

And an excessively simple profiling function:

def profiler(f, n, *args, **kwargs):
    start = time.time()
    for i in xrange(n):
        f(*args, **kwargs)
    return time.time() - start

结果:

(取自1000调用函数秒。)

(Seconds taken for 1,000 calls to the function.)

  • 在简单的未编译:0.918624162674
  • 二进制未编译:1.01497793198
  • 简单编译:0.287325024605
  • 二进制编译:0.00327413797379

该汇编的结果包括一次编译选择功能所需的平均时间。 (Ⅰ定时1000编译,然后除以该时间由1000,并补充结果到选择功能时)。

The "compiled" results include the average time taken to compile the choice function once. (I timed 1,000 compiles, then divided that time by 1,000, and added the result to the choice function time.)

所以:如果您有物品+的重量,将很少改变的列表,二进制编译的方法的目前的最快

So: if you have a list of items+weights which change very rarely, the binary compiled method is by far the fastest.

这篇关于在Python概率分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆