有效的算法随机选择频率项目 [英] Efficient algorithm to randomly select items with frequency

查看：132 发布时间：2015/11/30 20:19:12 algorithm random big-o

本文介绍了有效的算法随机选择频率项目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

由于数组 N 字频对：

[ (w₀, f₀), (w₁, f₁), ..., (w_n-1, f_n-1) ]

其中，是W <子>我 是一个字， F <子>我 是一个整数frequencey和频率的总和＆总和;˚F<子>我 = M ，

where w_i is a word, f_i is an integer frequencey, and the sum of the frequencies ∑f_i = m,

我想用一个伪随机数生成器（PRNG）选择 P 字是W <子>Ĵ<子> 0 ，W <子>Ĵ<子> 1 ，...，W <子>Ĵ<子> P-1 ，使得选择任何词的概率正比于它的频率：

I want to use a pseudo-random number generator (pRNG) to select p words w_j₀, w_j₁, ..., w_{j_p-1} such that the probability of selecting any word is proportional to its frequency:

P(w_i = w_{j_k}) = P(i = j_k) = f_i / m

（注意，这是选择与更换，所以同一个词的可以的每一次选择）。

我拿出三种算法至今：

创建大小的数组 M ，并填充它，第一个 F ₀ 项是W <子> 0 ，接下来的 F ₁ 项是W <子> 1 ，等等，所以最后 F <子> P-1 项是W <子> P-1 。

Create an array of size m, and populate it so the first f₀ entries are w₀, the next f₁ entries are w₁, and so on, so the last f_p-1 entries are w_p-1.

[ w₀, ..., w₀, w₁,..., w₁, ..., w_p-1, ..., w_p-1 ]

然后使用PRNG在范围 0 ... M-1 选择 P 指数，并报告存储的话这些索引。
这需要 O（N + M + P）工作，这是不是很大，因为 M 可得多大于n。

Then use the pRNG to select p indices in the range 0...m-1, and report the words stored at those indices.
This takes O(n + m + p) work, which isn't great, since m can be much much larger than n.

通过输入数组步骤一次，计算

Step through the input array once, computing

m_i = ∑_h≤if_h = m_i-1 + f_i

和计算后的 M <子>我 ，使用PRNG来生成许多 X <子> K 的范围 0 ... M <子>我 1 每个 K 在 0 ... P-1 并选择是W <子>我 为是W <子>Ĵ<子> K （可能与更换的当前值w <子>Ĵ<子> K ）如果 x <子> K ＆LT; ˚F<子>我 。
这就要求 O（N + NP）的工作。

and after computing m_i, use the pRNG to generate a number x_k in the range 0...m_i-1 for each k in 0...p-1 and select w_i for w_{j_k} (possibly replacing the current value of w_{j_k}) if x_k < f_i.
This requires O(n + np) work.

[ (w₀, f₀, m₀), (w₁, f₁, m₁), ..., (w_n-1, f_n-1, m_n-1) ]

然后，在 0 ... P-1 每个K，利用PRNG生成一个号码 X <子> K 的范围 0 ... M-1 然后执行三元的阵列上的二进制搜索找到我 ST M <子>我 -f <子>我和乐; x <子> K ＆LT;米<子>我 ，然后选择是W <子>我 为是W <子>Ĵ<子> K 。
这就要求 O（N + P log n）的的工作。

and then, for each k in 0...p-1, use the pRNG to generate a number x_k in the range 0...m-1 then do binary search on the array of triples to find the i s.t. m_i-f_i ≤ x_k < m_i, and select w_i for w_{j_k}.
This requires O(n + p log n) work.

我的问题是，：有没有更有效的算法，我可以利用这一点，或者是这些爱在心里口难开

My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?

推荐答案

好吧，我发现了另一种算法：的别名方法（也提到<一个href="http://stackoverflow.com/questions/352670/weighted-random-selection-with-and-without-replacement/353576#353576">in这个答案）。基本上它创建的概率空间的划分，使得：

Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:

有 N 分区，所有相同的宽度研究 ST NR = M 。
每个分区包含两个词语的一些比（存储与分区）。
每个字是W <子>我 ， F <子>我 =总和; <子>分区牛逼ST是W <子>我＆ISIN代码;牛逼 R＆倍;比（T，W <子>我）

There are n partitions, all of the same width r s.t. nr = m.
each partition contains two words in some ratio (which is stored with the partition).
for each word w_i, f_i = ∑_{partitions t s.t w_i ∈ t} r × ratio(t,w_i)

由于所有的分区都是相同的大小，选择哪一个分区可以在不断的工作来完成的（挑选指数 0，...，N-1 随机）和分区的比率然后可用于选择字用于在恒定工作（比较与两个词之间的比率一pRNGed数）。因此，这意味着 P 的选择可以在 0（P）工作完成后，给出了这样一个分区。

Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.

这样的划分存在的原因是，存在一个字是W <子>我 ST F <子>我＆LT; - [R ，当且仅当存在一个字是W <子>我' ST F <子>我'＆GT; ř，因为r是频率的平均值。

The reason that such a partitioning exists is that there exists a word w_i s.t. f_i < r, if and only if there exists a word w_i' s.t. f_i' > r, since r is the average of the frequencies.

由于这样一对是W <子>我 和是W <子>我' ，我们可以用一个伪字替换它们 W'<子>我 频率 F<子>我 = R （即重新presents 是W <子>我 的概率 F <子>我 / R 和是W <子>我' 的概率 1 - ˚F<子>我 / R ）和一个新词 W'<子>我' 的调整频率 F<子>我' = F <子>我' - （R - ˚F<子>我）分别。的所有字的平均频率仍然是r和从现有段的规则仍然适用。由于伪字具有频率r和由两个单词，频率＆ね; R，我们知道，如果我们重复这个过程中，我们永远不会让一个伪字出伪字，而这种迭代必须以正伪字这是所需的分区序列结束。

Given such a pair w_i and w_i' we can replace them with a pseudo-word w'_i of frequency f'_i = r (that represents w_i with probability f_i/r and w_i' with probability 1 - f_i/r) and a new word w'_i' of adjusted frequency f'_i' = f_i' - (r - f_i) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.

要构建在 O（N）该分区时，

办理的词列表一次，构建两个列表：
- 在频率和乐字之一; - [R
- 在频率和GT字之一; - [R
- go through the list of the words once, constructing two lists:
  - one of words with frequency ≤ r
  - one of words with frequency > r
  - 如果其频率= r，则使之变成一个元素分区
  - 否则，拉动从其他列表中的单词，并用它来填写一个双字的分区。然后将第二个字回第一或第二列表，根据调整后的频率。
  这其实还是工作，如果分区数 Q＆GT; ñ（你必须要证明这一点不同）。如果你想确保r是不可或缺的，你不能很容易地找到一个因素① M 的ST Q＆GT; ñ，你可以垫所有的频率由 N ，所以 F<子>我 = NF <子>我 ，其中更新 M'= MN 和集 R'= M 在①= N 。
  
  This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'_i = nf_i, which updates m' = mn and sets r' = m when q = n.
  
  在任何情况下，这种算法只需要 O（N + P）的工作，我不得不认为这是最佳的。
  
  In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.
  
  在红宝石：
```
def weighted_sample_with_replacement(input, p)
  n = input.size
  m = input.inject(0) { |sum,(word,freq)| sum + freq }

  # find the words with frequency lesser and greater than average
  lessers, greaters = input.map do |word,freq| 
                        # pad the frequency so we can keep it integral
                        # when subdivided
                        [ word, freq*n ] 
                      end.partition do |word,adj_freq| 
                        adj_freq <= m 
                      end

  partitions = Array.new(n) do
    word, adj_freq = lessers.shift

    other_word = if adj_freq < m
                   # use part of another word's frequency to pad
                   # out the partition
                   other_word, other_adj_freq = greaters.shift
                   other_adj_freq -= (m - adj_freq)
                   (other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
                   other_word
                 end

    [ word, other_word , adj_freq ]
  end

  (0...p).map do 
    # pick a partition at random
    word, other_word, adj_freq = partitions[ rand(n) ]
    # select the first word in the partition with appropriate
    # probability
    if rand(m) < adj_freq
      word
    else
      other_word
    end
  end
end
```
  这篇关于有效的算法随机选择频率项目的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有效的算法随机选择频率项目 [英] Efficient algorithm to randomly select items with frequency

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

有效的算法随机选择频率项目 [英] Efficient algorithm to randomly select items with frequency

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭