从N个集合中随机选择n个记录 [英] Select n records at random from a set of N

查看：171 发布时间：2020/6/3 19:59:10 algorithm random-sample correctness

本文介绍了从N个集合中随机选择n个记录的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要从一组 N （其中个）中随机选择 n 个记录0< n< N ）。

I need to select n records at random from a set of N (where 0 < n < N).

可能的算法是：

遍历列表，并为每个元素选择概率为（所需数量）/（剩余数量）

因此，如果您有40个项目，则第一个被选中的可能性为 5/40 。

So if you had 40 items, the first would have a 5/40 chance of being selected.

如果是，则下一个概率为 4/39 ，否则，概率为 5/39 时，您将拥有5件商品，并且在此之前通常都拥有它们。


If it is, the next has a 4/39 chance, otherwise it has a 5/39 chance. By the time you get to the end
  you will have your 5 items, and often you'll have all of them before that.
假设一个好的伪随机数生成器，这个算法正确吗？
Assuming a good pseudo-random number generator, is this algorithm correct?
 注意 
关于stackoverflow的此类问题很多（很多问题被标记为从C＃中的List 中选择N个随机元素）。

There're many questions of this kind on stackoverflow (a lot of them are marked as duplicates of Select N random elements from a List<T> in C#).
经常提出上述算法（例如 Kyle Cronin 的答案）和
总是受到质疑（例如，请参阅
 此处，此处，此处，这里 ...）。
The above algorithm is often proposed (e.g. Kyle Cronin's answer) and
it's always questioned (e.g. see
here, here, here, here...).
我可以对这件事有最后的决定吗？
Can I have a final word about the matter?
推荐答案
算法绝对正确。
这不是好的海报的突然发明，而是一种称为选择抽样 / 算法S 的众所周知的技术（由Fan，Muller和Rezucha发现（1）并由Jones（2）在1962年独立提出），在 TAOCP -第2卷-Seminumerical算法-第3.4.2。节
It's not the sudden invention of a good poster, it's a well known technique called Selection sampling / Algorithm S (discovered by Fan, Muller and Rezucha (1) and independently by Jones (2) in 1962), well described in TAOCP - Volume 2 - Seminumerical Algorithms - § 3.4.2.
正如努斯所说：
 
 乍一看似乎是不可靠的，实际上是不正确的。但是仔细分析表明它是完全值得信赖的。

  This algorithm may appear to be unreliable at first glance and, in fact, to be incorrect. But a careful analysis shows that it is completely trustworthy.
算法对 n  N 和 t + 1  st的一组元素中选择>元素的概率为（n-m）/（N-t），当已经选择 m 个元素时。
The algorithm samples n elements from a set of size N and the t + 1st element is chosen with probability (n - m) / (N - t), when already m elements have been chosen.
很容易看出，在选择 n 个项目之前，我们从来没有跑过集的末尾（因为概率为 1 ，当我们有 k 个元素要从其余的 k 个元素中选择时）。
It's easy to see that we never run off the end of the set before choosing n items (as the probability will be 1 when we have k elements to choose from the remaining k elements).
我们也永远不会选择太多元素（ n == m的概率将是 0  ）。
Also we never pick too many elements (the probability will be 0 as soon n == m).
要证明样本是完全无偏的要难一点，但这是
It's a bit harder to demonstrate that the sample is completely unbiased, but it's
 
  ...正确，尽管我们不是选择 t + 1  st概率为 n / N 的项目。 

  ... true in spite of the fact that we are not selecting the t + 1st item with probability n / N. This has caused some confusion in the published literature
（所以不仅仅是在Stackoverflow上）
(so not just on Stackoverflow!)
事实是我们不应该混淆条件概率和无条件概率：
The fact is we should not confuse conditional and unconditional probabilities:
 
 例如，考虑第二个元素；如果在样本中选择了第一个元素（发生概率为 n / N ），则选择了第二个元素的概率为（n-1） /（N-1）;如果未选择第一个元素，则选择第二个元素的概率为 n /（N-1）。

  For example consider the second element; if the first element was selected in the sample (this happen with probability n / N), the second element is selected with probability (n - 1) / (N - 1); if the first element was not selected, the second element is selected with probability n / (N - 1).
选择第二个元素的总概率为（n / N）（（n-1）/（N-1））+（1-n / N）（n /（N-1） ）= n / N 。
The overall probability of selecting the second element is (n / N) ((n - 1) / (N - 1)) + (1 - n/N)(n / (N - 1)) = n/N.
  ^{TAOCP-第2卷-第3.4节。 2练习3}  
除了理论上的考虑之外， Algorithm S （和 algorithm R  > /  水库采样 ）已在许多知名图书馆中使用（例如 SGI的原始STL实现，  std :: experimental :: sample  ，
   random.sample  在P中ython ...）。
Apart from theoretical considerations, Algorithm S (and algorithm R / reservoir sampling) is used in many well known libraries (e.g. SGI's original STL implementation, std::experimental::sample,
random.sample in Python...).
当然，算法S并非总是最佳答案：
Of course algorithm S is not always the best answer:
 
 它是 O（N）（即使我们通常不必跳过所有 N 条记录： n = 2 大约为 2/3 N 时考虑的平均记录数；常规公式在
 TAOCP-第2卷-第3.4.2节-ex 5/6中给出； 
 
 不能使用提前不知道 N 的值。
 
 

it's O(N) (even if we will usually not have to pass over all N records: the average number of records considered when n=2 is about 2/3 N; the general formulas are given in
TAOCP - Vol 2 - § 3.4.2 - ex 5/6);
it cannot be used when the value of N isn't known in advance.

无论如何！
 
  C。 T. Fan，M。E. Muller和I. Rezucha，J。Amer。统计副会长57（1962），第387-402页
 
  T。 G.Jones，CACM 5（1962），第343页
 
 
 
 
 
 
 
  编辑 
 
 您如何随机选择此项，概率为7/22 

  how do you randomly select this item, with a probability of 7/22
 [CUT] 
在极少数情况下，当您需要5个元素时甚至可以选择4个或6个元素
In rare cases, you might even pick 4 or 6 elements when you wanted 5
这来自 N3925 （为避免通用接口/标签分发而作的小修改）：
This is from N3925 (small modifications to avoid the common interface / tag dispatch):
template<class PopIter, class SampleIter, class Size, class URNG>
SampleIter sample(PopIter first, PopIter last, SampleIter out, Size n, URNG &&g)
{
  using dist_t = uniform_int_distribution<Size>;
  using param_t = typename dist_t::param_type;

  dist_t d{};

  Size unsampled_sz = distance(first, last);
  for (n = min(n, unsampled_sz); n != 0;  ++first)
  {
    param_t const p{0, --unsampled_sz};

    if (d(g, p) < n) { *out++ = *first; --n; }
  }

  return out;
}

没有浮点数。
 
 如果需要5个元素，您将获得5个元素； 
 
 如果 uniform_int_distribution  如广告所述，没有偏见。
 
 

If you need 5 elements you get 5 elements;
if uniform_int_distribution "works as advertised" there is no bias.


                        这篇关于从N个集合中随机选择n个记录的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

从N个集合中随机选择n个记录 [英] Select n records at random from a set of N

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从N个集合中随机选择n个记录 [英] Select n records at random from a set of N

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭