从N个集合中随机选择n个记录 [英] Select n records at random from a set of N
问题描述
我需要从一组 N
(其中个)中随机选择
)。 n
个记录0< n< N
I need to select n
records at random from a set of N
(where 0 < n < N
).
可能的算法是:
遍历列表,并为每个元素选择概率为
(所需数量)/(剩余数量)
因此,如果您有40个项目,则第一个被选中的可能性为 5/40
。
So if you had 40 items, the first would have a 5/40
chance of being selected.
如果是,则下一个概率为 4/39
,否则,概率为 5/39 $机会。到最后
时,您将拥有5件商品,并且在此之前通常都拥有它们。
If it is, the next has a 4/39
chance, otherwise it has a 5/39
chance. By the time you get to the end
you will have your 5 items, and often you'll have all of them before that.
假设一个好的伪随机数生成器,这个算法正确吗?
Assuming a good pseudo-random number generator, is this algorithm correct?
注意
关于stackoverflow的此类问题很多(很多问题被标记为从C#中的List
There're many questions of this kind on stackoverflow (a lot of them are marked as duplicates of Select N random elements from a List<T> in C#).
经常提出上述算法(例如 Kyle Cronin 的答案)和
总是受到质疑(例如,请参阅
此处,此处,此处,这里 ...)。
The above algorithm is often proposed (e.g. Kyle Cronin's answer) and it's always questioned (e.g. see here, here, here, here...).
我可以对这件事有最后的决定吗?
Can I have a final word about the matter?
推荐答案
算法绝对正确。
这不是好的海报的突然发明,而是一种称为选择抽样 / 算法S 的众所周知的技术(由Fan,Muller和Rezucha发现(1)并由Jones(2)在1962年独立提出),在 TAOCP -第2卷-Seminumerical算法-第3.4.2。节
It's not the sudden invention of a good poster, it's a well known technique called Selection sampling / Algorithm S (discovered by Fan, Muller and Rezucha (1) and independently by Jones (2) in 1962), well described in TAOCP - Volume 2 - Seminumerical Algorithms - § 3.4.2.
正如努斯所说:
乍一看似乎是不可靠的,实际上是不正确的。但是仔细分析表明它是完全值得信赖的。
This algorithm may appear to be unreliable at first glance and, in fact, to be incorrect. But a careful analysis shows that it is completely trustworthy.
算法对 n
N 和 t + 1
st的一组元素中选择>元素的概率为(n-m)/(N-t)
,当已经选择 m
个元素时。
The algorithm samples n
elements from a set of size N
and the t + 1
st element is chosen with probability (n - m) / (N - t)
, when already m
elements have been chosen.
很容易看出,在选择 n
个项目之前,我们从来没有跑过集的末尾(因为概率为 1
,当我们有 k
个元素要从其余的 k
个元素中选择时)。
It's easy to see that we never run off the end of the set before choosing n
items (as the probability will be 1
when we have k
elements to choose from the remaining k
elements).
我们也永远不会选择太多元素( n == m的概率将是
)。 0
Also we never pick too many elements (the probability will be 0
as soon n == m
).
要证明样本是完全无偏的要难一点,但这是
It's a bit harder to demonstrate that the sample is completely unbiased, but it's
...正确,尽管我们不是选择
t + 1
st概率为n / N
的项目。
... true in spite of the fact that we are not selecting the
t + 1
st item with probabilityn / N
. This has caused some confusion in the published literature
(所以不仅仅是在Stackoverflow上)
(so not just on Stackoverflow!)
事实是我们不应该混淆条件概率和无条件概率:
The fact is we should not confuse conditional and unconditional probabilities:
例如,考虑第二个元素;如果在样本中选择了第一个元素(发生概率为
n / N
),则选择了第二个元素的概率为(n-1) /(N-1)
;如果未选择第一个元素,则选择第二个元素的概率为n /(N-1)
。
For example consider the second element; if the first element was selected in the sample (this happen with probability
n / N
), the second element is selected with probability(n - 1) / (N - 1)
; if the first element was not selected, the second element is selected with probabilityn / (N - 1)
.
选择第二个元素的总概率为(n / N)((n-1)/(N-1))+(1-n / N)(n /(N-1) )= n / N
。
The overall probability of selecting the second element is (n / N) ((n - 1) / (N - 1)) + (1 - n/N)(n / (N - 1)) = n/N
.
TAOCP-第2卷-第3.4节。 2练习3
除了理论上的考虑之外, Algorithm S (和 algorithm R > / 水库采样 )已在许多知名图书馆中使用(例如 SGI的原始STL实现, std :: experimental :: sample
,
random.sample
在P中ython ...)。
Apart from theoretical considerations, Algorithm S (and algorithm R / reservoir sampling) is used in many well known libraries (e.g. SGI's original STL implementation, std::experimental::sample
,
random.sample
in Python...).
当然,算法S并非总是最佳答案:
Of course algorithm S is not always the best answer:
- 它是
O(N)
(即使我们通常不必跳过所有N
条记录:n = 2
大约为2/3 N
时考虑的平均记录数;常规公式在
TAOCP-第2卷-第3.4.2节-ex 5/6中给出; - 不能使用提前不知道
N
的值。
- it's
O(N)
(even if we will usually not have to pass over allN
records: the average number of records considered whenn=2
is about2/3 N
; the general formulas are given in TAOCP - Vol 2 - § 3.4.2 - ex 5/6); - it cannot be used when the value of
N
isn't known in advance.
无论如何!
- C。 T. Fan,M。E. Muller和I. Rezucha,J。Amer。统计副会长57(1962),第387-402页
- T。 G.Jones,CACM 5(1962),第343页
编辑
您如何随机选择此项,概率为7/22
how do you randomly select this item, with a probability of 7/22
[CUT]
在极少数情况下,当您需要5个元素时甚至可以选择4个或6个元素
In rare cases, you might even pick 4 or 6 elements when you wanted 5
这来自 N3925 (为避免通用接口/标签分发而作的小修改):
This is from N3925 (small modifications to avoid the common interface / tag dispatch):
template<class PopIter, class SampleIter, class Size, class URNG>
SampleIter sample(PopIter first, PopIter last, SampleIter out, Size n, URNG &&g)
{
using dist_t = uniform_int_distribution<Size>;
using param_t = typename dist_t::param_type;
dist_t d{};
Size unsampled_sz = distance(first, last);
for (n = min(n, unsampled_sz); n != 0; ++first)
{
param_t const p{0, --unsampled_sz};
if (d(g, p) < n) { *out++ = *first; --n; }
}
return out;
}
没有浮点数。
- 如果需要5个元素,您将获得5个元素;
- 如果
uniform_int_distribution
如广告所述,没有偏见。
- If you need 5 elements you get 5 elements;
- if
uniform_int_distribution
"works as advertised" there is no bias.
这篇关于从N个集合中随机选择n个记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!