带替换和不带替换的加权随机选择 [英] Weighted random selection with and without replacement

查看:19
本文介绍了带替换和不带替换的加权随机选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我需要从列表中对元素进行加权随机选择,包括替换和不替换.虽然有用于未加权选择的众所周知的好算法,以及一些用于无替换加权选择的算法(例如 resevoir 算法的修改),但我找不到任何用于带替换加权选择的好算法.我还想避免使用 resevoir 方法,因为我选择了列表的很大一部分,它小到足以保存在内存中.

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.

有人对这种情况下的最佳方法有什么建议吗?我有自己的解决方案,但我希望找到更高效、更简单的方法,或者两者兼而有之.

Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.

推荐答案

使用来自不变列表的替换样本制作多个样本的最快方法之一是别名方法.核心直觉是我们可以为加权列表创建一组大小相等的 bin,可以通过位操作非常有效地索引,以避免二分搜索.事实证明,如果操作正确,我们将只需要在每个 bin 中存储来自原始列表的两个项目,因此可以用单个百分比表示拆分.

One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.

让我们以五个同等权重的选择为例,(a:1, b:1, c:1, d:1, e:1)

Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)

要创建别名查找:

  1. 标准化权重,使它们的总和为 1.0.(a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) 这是选择每个权重的概率.

  1. Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.

找出大于或等于变量数量的最小2的幂,并创建这个数量的分区,|p|.每个分区代表 1/|p| 的概率质量.在本例中,我们创建 8 个分区,每个分区可以包含 0.125.

Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.

取剩余重量最小的变量,并将尽可能多的质量放在一个空的分区中.在这个例子中,我们看到 a 填充了第一个分区.(p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8)(a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)

Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)

如果分区没有填充,取权重最大的变量,用那个变量填充分区.

If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.

重复第 3 步和第 4 步,直到不需要将原始分区的权重分配给列表.

Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.

例如,如果我们再次运行 3 和 4 的迭代,我们会看到

For example, if we run another iteration of 3 and 4, we see

(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8)(a:0,b:0.15 c:0.2 d:0.2 e:0.2) 留待分配

运行时:

  1. 得到一个U(0,1)随机数,比如二进制0.001100000

bitshift it lg2(p),找到索引分区.因此,我们将它移动 3,产生 001.1,或位置 1,因此分区 2.

bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.

如果分区被拆分,则使用移位后的随机数的小数部分来决定拆分.在这种情况下,值为 0.5,并且 0.5 <0.6,所以返回a.

If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.

这里有一些代码和另一种解释,但是不幸的是,它没有使用位移技术,我也没有实际验证过.

Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.

这篇关于带替换和不带替换的加权随机选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆