从python中的权重数组获取随机索引的快速方法 [英] Fast way to obtain a random index from an array of weights in python

查看:320
本文介绍了从python中的权重数组获取随机索引的快速方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常发现自己需要一个随机的数组或列表索引,其中索引的概率不是均匀分布的,而是根据一定的正权重。获得它们的快速方法是什么?我知道我可以将权重作为可选参数 p 传递给 numpy.random.choice ,但是该功能似乎很慢,并且建立 range 传递它也不理想。权重之和可以是任意正数,并且不能保证为1,这使得生成(0,1]中的随机数然后减去权重项直到结果为0或更小的方法都是不可能的。

I regularly find myself in the position of needing a random index to an array or a list, where the probabilities of indices are not uniformly distributed, but according to certain positive weights. What's a fast way to obtain them? I know I can pass weights to numpy.random.choice as optional argument p, but the function seems quite slow, and building an arange to pass it is not ideal either. The sum of weights can be an arbitrary positive number and is not guaranteed to be 1, which makes the approach to generate a random number in (0,1] and then substracting weight entries until the result is 0 or less impossible.

虽然有一些关于如何以简单的方式(例如,加权选择简短而又,我正在寻找一种快速的解决方案,因为适当的功能经常执行。权重经常变化,因此构建别名掩码之类的开销(有关详细介绍,请参见 http://www.keithschwarz.com/darts-dice-coins/ )应视为计算时间的一部分。

While there are answers on how to implement similar things (mostly not about obtaining the array index, but the corresponding element) in a simple manner, such as Weighted choice short and simple, I'm looking for a fast solution, because the appropriate function is executed very often. My weights change frequently, so the overhead of building something like an alias mask (a detailed introduction can be found on http://www.keithschwarz.com/darts-dice-coins/) should be considered part of the calculation time.

推荐答案

累积求和和bisect



在任何通用情况下,似乎建议计算权重的累加和,并使用bisect模块中的bisect在所得排序数组中查找随机点

Cumulative summing and bisect

In any generic case, it seems advisable to calculate the cumulative sum of weights, and use bisect from the bisect module to find a random point in the resulting sorted array

def weighted_choice(weights):
    cs = numpy.cumsum(weights)
    return bisect.bisect(cs, numpy.random.random() * cs[-1])

如果速度是a关心。下面提供了更详细的分析。

if speed is a concern. A more detailed analysis is given below.

注意:如果数组不是平坦的,则 numpy.unravel_index 可用于将平面索引转换为形状索引,如 https://stackoverflow.com/a / 19760118/1274613

Note: If the array is not flat, numpy.unravel_index can be used to transform a flat index into a shaped index, as seen in https://stackoverflow.com/a/19760118/1274613

使用 numpy 内置函数。使用 timeit 进行比较,得出以下结果:

There are four more or less obvious solutions using numpy builtin functions. Comparing all of them using timeit gives the following result:

import timeit

weighted_choice_functions = [
"""import numpy
wc = lambda weights: numpy.random.choice(
    range(len(weights)),
    p=weights/weights.sum())
""",
"""import numpy
# Adapted from https://stackoverflow.com/a/19760118/1274613
def wc(weights):
    cs = numpy.cumsum(weights)
    return cs.searchsorted(numpy.random.random() * cs[-1], 'right')
""",
"""import numpy, bisect
# Using bisect mentioned in https://stackoverflow.com/a/13052108/1274613
def wc(weights):
    cs = numpy.cumsum(weights)
    return bisect.bisect(cs, numpy.random.random() * cs[-1])
""",
"""import numpy
wc = lambda weights: numpy.random.multinomial(
    1,
    weights/weights.sum()).argmax()
"""]

for setup in weighted_choice_functions:
    for ps in ["numpy.ones(40)",
               "numpy.arange(10)",
               "numpy.arange(200)",
               "numpy.arange(199,-1,-1)",
               "numpy.arange(4000)"]:
        timeit.timeit("wc(%s)"%ps, setup=setup)
    print()

结果输出为

178.45797914802097
161.72161589498864
223.53492237901082
224.80936180002755
1901.6298267539823

15.197789980040397
19.985687876993325
20.795070077001583
20.919113760988694
41.6509403079981

14.240949985047337
17.335801470966544
19.433710905024782
19.52205040602712
35.60536142199999

26.6195822560112
20.501282756973524
31.271995796996634
27.20013752405066
243.09768892999273

这意味着 numpy.random.choice 是令人惊讶的非常慢,甚至专用的numpy searchsorted 方法也比类型慢-naive bisect 变体。 (这些结果是使用numpy 1.8.1和Python 3.3.5获得的,因此其他版本可能有所不同。)基于 numpy.random.multinomial 的函数较少对于大的权重,比基于累积求和的方法更有效。大概是argmax必须遍历整个数组并进行比较的事实在每个步骤中都起着重要的作用,从增加和减少权重列表之间的四秒差异也可以看出这一点。

This means that numpy.random.choice is surprisingly very slow, and even the dedicated numpy searchsorted method is slower than the type-naive bisect variant. (These results were obtained using Python 3.3.5 with numpy 1.8.1, so things may be different for other versions.) The function based on numpy.random.multinomial is less efficient for large weights than the methods based on cumulative summing. Presumably the fact that argmax has to iterate over the whole array and run comparisons each step plays a significant role, as can be seen as well from the four second difference between an increasing and a decreasing weight list.

这篇关于从python中的权重数组获取随机索引的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆