有效的算法随机选择频率项目 [英] Efficient algorithm to randomly select items with frequency
问题描述
由于数组 N
字频对:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]
其中,是W <子>我
是一个字, F <子>我
是一个整数frequencey和频率的总和&总和;˚F<子>我 = M
,
where wi
is a word, fi
is an integer frequencey, and the sum of the frequencies ∑fi = m
,
我想用一个伪随机数生成器(PRNG)选择 P
字是W <子>Ĵ<子> 0 子> ,W <子>Ĵ<子> 1 ,...,W <子>Ĵ<子> P-1
,使得
选择任何词的概率正比于它的频率:
I want to use a pseudo-random number generator (pRNG) to select p
words wj0, wj1, ..., wjp-1
such that
the probability of selecting any word is proportional to its frequency:
P(wi = wjk) = P(i = jk) = fi / m
(注意,这是选择与更换,所以同一个词的可以的每一次选择)。
我拿出三种算法至今:
-
创建大小的数组
M
,并填充它,第一个F 0
项是W <子> 0
,接下来的F 1
项是W <子> 1
,等等,所以最后F <子> P-1 子>
项是W <子> P-1
。
Create an array of size
m
, and populate it so the firstf0
entries arew0
, the nextf1
entries arew1
, and so on, so the lastfp-1
entries arewp-1
.
[ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ]
然后使用PRNG在范围 0 ... M-1
选择 P
指数,并报告存储的话这些索引。
这需要 O(N + M + P)
工作,这是不是很大,因为 M
可得多大于n。
Then use the pRNG to select p
indices in the range 0...m-1
, and report the words stored at those indices.
This takes O(n + m + p)
work, which isn't great, since m
can be much much larger than n.
通过输入数组步骤一次,计算
Step through the input array once, computing
mi = ∑h≤ifh = mi-1 + fi
和计算后的 M <子>我
,使用PRNG来生成许多 X <子> K
的范围 0 ... M <子>我 1
每个 K
在 0 ... P-1
并选择是W <子>我
为是W <子>Ĵ<子> K
(可能与更换的当前值w <子>Ĵ<子> K
)如果 x <子> K &LT; ˚F<子>我
。
这就要求 O(N + NP)
的工作。
and after computing mi
, use the pRNG to generate a number xk
in the range 0...mi-1
for each k
in 0...p-1
and select wi
for wjk
(possibly replacing the current value of wjk
) if xk < fi
.
This requires O(n + np)
work.
[ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ]
然后,在 0 ... P-1
每个K,利用PRNG生成一个号码 X <子> K
的范围 0 ... M-1
然后执行三元的阵列上的二进制搜索找到我
ST M <子>我 -f <子>我和乐; x <子> K &LT;米<子>我
,然后选择是W <子>我
为是W <子>Ĵ<子> K
。
这就要求 O(N + P log n)的
的工作。
and then, for each k in 0...p-1
, use the pRNG to generate a number xk
in the range 0...m-1
then do binary search on the array of triples to find the i
s.t. mi-fi ≤ xk < mi
, and select wi
for wjk
.
This requires O(n + p log n)
work.
我的问题是,:有没有更有效的算法,我可以利用这一点,或者是这些爱在心里口难开
My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?
推荐答案
好吧,我发现了另一种算法:的别名方法(也提到<一个href="http://stackoverflow.com/questions/352670/weighted-random-selection-with-and-without-replacement/353576#353576">in这个答案)。基本上它创建的概率空间的划分,使得:
Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:
- 有
N
分区,所有相同的宽度研究
STNR = M
。 - 每个分区包含两个词语的一些比(存储与分区)。
- 每个字
是W <子>我
,F <子>我 =总和; <子>分区牛逼ST是W <子>我&ISIN代码;牛逼 R&倍;比(T,W <子>我)
- There are
n
partitions, all of the same widthr
s.t.nr = m
. - each partition contains two words in some ratio (which is stored with the partition).
- for each word
wi
,fi = ∑partitions t s.t wi ∈ t r × ratio(t,wi)
由于所有的分区都是相同的大小,选择哪一个分区可以在不断的工作来完成的(挑选指数 0,...,N-1
随机)和分区的比率然后可用于选择字用于在恒定工作(比较与两个词之间的比率一pRNGed数)。因此,这意味着 P
的选择可以在 0(P)
工作完成后,给出了这样一个分区。
Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1
at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p
selections can be done in O(p)
work, given such a partition.
这样的划分存在的原因是,存在一个字是W <子>我
ST F <子>我&LT; - [R
,当且仅当存在一个字是W <子>我'
ST F <子>我'&GT; ř
,因为r是频率的平均值。
The reason that such a partitioning exists is that there exists a word wi
s.t. fi < r
, if and only if there exists a word wi'
s.t. fi' > r
, since r is the average of the frequencies.
由于这样一对是W <子>我
和是W <子>我'
,我们可以用一个伪字替换它们 W'<子>我
频率 F<子>我 = R
(即重新presents 是W <子>我
的概率 F <子>我 / R
和是W <子>我'
的概率 1 - ˚F<子>我 / R
)和一个新词 W'<子>我'
的调整频率 F<子>我' = F <子>我' - (R - ˚F<子>我)
分别。的所有字的平均频率仍然是r和从现有段的规则仍然适用。由于伪字具有频率r和由两个单词,频率&ね; R,我们知道,如果我们重复这个过程中,我们永远不会让一个伪字出伪字,而这种迭代必须以正伪字这是所需的分区序列结束。
Given such a pair wi
and wi'
we can replace them with a pseudo-word w'i
of frequency f'i = r
(that represents wi
with probability fi/r
and wi'
with probability 1 - fi/r
) and a new word w'i'
of adjusted frequency f'i' = fi' - (r - fi)
respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.
要构建在 O(N)
该分区时,
- 办理的词列表一次,构建两个列表:
- 在频率和乐字之一; - [R
- 在频率和GT字之一; - [R
- go through the list of the words once, constructing two lists:
- one of words with frequency ≤ r
- one of words with frequency > r
- 如果其频率= r,则使之变成一个元素分区
- 否则,拉动从其他列表中的单词,并用它来填写一个双字的分区。然后将第二个字回第一或第二列表,根据调整后的频率。
这其实还是工作,如果分区数
Q&GT; ñ
(你必须要证明这一点不同)。如果你想确保r是不可或缺的,你不能很容易地找到一个因素①
M
的STQ&GT; ñ
,你可以垫所有的频率由N
,所以F<子>我 = NF <子>我
,其中更新M'= MN
和集R'= M
在①= N
。This actually still works if the number of partitions
q > n
(you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factorq
ofm
s.t.q > n
, you can pad all the frequencies by a factor ofn
, sof'i = nfi
, which updatesm' = mn
and setsr' = m
whenq = n
.在任何情况下,这种算法只需要
O(N + P)
的工作,我不得不认为这是最佳的。In any case, this algorithm only takes
O(n + p)
work, which I have to think is optimal.在红宝石:
def weighted_sample_with_replacement(input, p) n = input.size m = input.inject(0) { |sum,(word,freq)| sum + freq } # find the words with frequency lesser and greater than average lessers, greaters = input.map do |word,freq| # pad the frequency so we can keep it integral # when subdivided [ word, freq*n ] end.partition do |word,adj_freq| adj_freq <= m end partitions = Array.new(n) do word, adj_freq = lessers.shift other_word = if adj_freq < m # use part of another word's frequency to pad # out the partition other_word, other_adj_freq = greaters.shift other_adj_freq -= (m - adj_freq) (other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ] other_word end [ word, other_word , adj_freq ] end (0...p).map do # pick a partition at random word, other_word, adj_freq = partitions[ rand(n) ] # select the first word in the partition with appropriate # probability if rand(m) < adj_freq word else other_word end end end
这篇关于有效的算法随机选择频率项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!