根据权重分布从列表中随机选择N个项目的最快算法是什么? [英] What would be the fastest algorithm to randomly select N items from a list based on weights distribution?

查看:71
本文介绍了根据权重分布从列表中随机选择N个项目的最快算法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多物品,每个物品都有重量.

I have a large list of items, each item has a weight.

我想随机选择N个物品而不替换它们,而重量更大的物品则更有可能被选择.

I'd like to select N items randomly without replacement, while the items with more weight are more probable to be selected.

我正在寻找效果最好的主意.性能至上.有什么想法吗?

I'm looking for the most performing idea. Performance is paramount. Any ideas?

推荐答案

如果要样本无需更换的物品,您有很多选择.

If you want to sample items without replacement, you have lots of options.

  • 使用加权选择替换算法选择随机索引.有许多类似的算法.其中一个是 WeightedChoice ,将在此答案后面进行描述,另一个是拒绝采样,如下所述.假设最大权重是 max ,并且有 n 个权重.要使用拒绝采样在[0, n )中选择索引,

  • Use a weighted-choice-with-replacement algorithm to choose random indices. There are many algorithms like this. One of them is WeightedChoice, described later in this answer, and another is rejection sampling, described as follows. Assume that the highest weight is max and there are n weights. To choose an index in [0, n) using rejection sampling:

  1. 在[0, n )中选择一个统一的随机整数 i .
  2. 以概率 weights [i]/max ,返回 i .否则,请转到步骤1.
  1. Choose a uniform random integer i in [0, n).
  2. With probability weights[i]/max, return i. Otherwise, go to step 1.

每次加权选择算法选择一个索引时,请将所选索引的权重设置为0,以防止再次选择它.或者...

Each time the weighted choice algorithm chooses an index, set the weight for the chosen index to 0 to keep it from being chosen again. Or...

为每个索引分配一个指数分布的随机数(比率等于该索引的权重),创建一个将每个数字分配给索引的对的列表,然后按这些数字对列表进行排序.然后从头到尾按照升序排列每个项目.可以使用优先级队列数据结构在线完成排序(该技术可导致

Assign each index an exponentially distributed random number (with a rate equal to that index's weight), make a list of pairs assigning each number to an index, then sort that list by those numbers. Then take each item from first to last, in ascending order. This sorting can be done on-line using a priority queue data structure (a technique that leads to weighted reservoir sampling). Notice that the naïve way to generate the random number, -ln(1-RNDU01())/weight, is not robust, however ("Index of Non-Uniform Distributions", under "Exponential distribution").

蒂姆·维埃拉(Tim Vieira)提供了

Tim Vieira gives additional options in his blog.

A 纸张比较了各种算法

编辑(8月19日):请注意,对于这些解决方案,权重表示给定项目在样本中 first 出现的可能性.此权重不一定是 n 个项目的给定样本包含该项目的机会(即,一个 inclusion概率).上面给出的方法并不一定要确保给定项目出现在随机样本中的概率与它的权重成正比;为此,请参见"概率相等或不相等的采样算法".

EDIT (Aug. 19): Note that for these solutions, the weight expresses how likely a given item will appear first in the sample. This weight is not necessarily the chance that a given sample of n items will include that item (that is, an inclusion probability). The methods given above will not necessarily ensure that a given item will appear in a random sample with probability proportional to its weight; for that, see "Algorithms of sampling with equal or unequal probabilities".

假设您要随机选择要替换的项目,以下是实现这种选择的伪代码.给定一个权重列表,它将返回随机索引(从0开始),该索引的选择与权重成正比.该算法是实现加权选择的直接方法.但是,如果您觉得它太慢,请参阅我的" 可替换的加权选择";进行其他算法的研究.

Assuming you want to choose items at random with replacement, here is pseudocode implementing this kind of choice. Given a list of weights, it returns a random index (starting at 0), chosen with a probability proportional to its weight. This algorithm is a straightforward way to implement weighted choice. But if it's too slow for you, see my section "Weighted Choice With Replacement" for a survey of other algorithms.

METHOD WChoose(weights, value)
    // Choose the index according to the given value
    lastItem = size(weights) - 1
    runningValue = 0
    for i in 0...size(weights) - 1
       if weights[i] > 0
          newValue = runningValue + weights[i]
          lastItem = i
          // NOTE: Includes start, excludes end
          if value < newValue: break
          runningValue = newValue
       end
    end
    // If we didn't break above, this is a last
    // resort (might happen because rounding
    // error happened somehow)
    return lastItem
END METHOD

METHOD WeightedChoice(weights)
    return WChoose(weights, RNDINTEXC(Sum(weights)))
END METHOD

这篇关于根据权重分布从列表中随机选择N个项目的最快算法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆