从N个集合中随机选择n个记录 [英] Select n records at random from a set of N

查看:171
本文介绍了从N个集合中随机选择n个记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从一组 N (其中个)中随机选择 n 个记录0< n< N )。

I need to select n records at random from a set of N (where 0 < n < N).

可能的算法是:


遍历列表,并为每个元素选择概率为(所需数量)/(剩余数量)

因此,如果您有40个项目,则第一个被选中的可能性为 5/40

So if you had 40 items, the first would have a 5/40 chance of being selected.

如果是,则下一个概率为 4/39 ,否则,概率为 5/39 时,您将拥有5件商品,并且在此之前通常都拥有它们。

If it is, the next has a 4/39 chance, otherwise it has a 5/39 chance. By the time you get to the end you will have your 5 items, and often you'll have all of them before that.

假设一个好的伪随机数生成器,这个算法正确吗?

Assuming a good pseudo-random number generator, is this algorithm correct?

注意

关于stackoverflow的此类问题很多(很多问题被标记为从C#中的List 中选择N个随机元素)。

There're many questions of this kind on stackoverflow (a lot of them are marked as duplicates of Select N random elements from a List<T> in C#).

经常提出上述算法(例如 Kyle Cronin 的答案)和
总是受到质疑(例如,请参阅
此处此处此处这里 ...)。

The above algorithm is often proposed (e.g. Kyle Cronin's answer) and it's always questioned (e.g. see here, here, here, here...).

我可以对这件事有最后的决定吗?

Can I have a final word about the matter?

推荐答案

算法绝对正确。

这不是好的海报的突然发明,而是一种称为选择抽样 / 算法S 的众所周知的技术(由Fan,Muller和Rezucha发现(1)并由Jones(2)在1962年独立提出),在 TAOCP -第2卷-Seminumerical算法-第3.4.2。节

It's not the sudden invention of a good poster, it's a well known technique called Selection sampling / Algorithm S (discovered by Fan, Muller and Rezucha (1) and independently by Jones (2) in 1962), well described in TAOCP - Volume 2 - Seminumerical Algorithms - § 3.4.2.

正如努斯所说:


乍一看似乎是不可靠的,实际上是不正确的。但是仔细分析表明它是完全值得信赖的。

This algorithm may appear to be unreliable at first glance and, in fact, to be incorrect. But a careful analysis shows that it is completely trustworthy.

算法对 n N t + 1 st的一组元素中选择>元素的概率为(n-m)/(N-t),当已经选择 m 个元素时。

The algorithm samples n elements from a set of size N and the t + 1st element is chosen with probability (n - m) / (N - t), when already m elements have been chosen.

很容易看出,在选择 n 个项目之前,我们从来没有跑过集的末尾(因为概率为 1 ,当我们有 k 个元素要从其余的 k 个元素中选择时)。

It's easy to see that we never run off the end of the set before choosing n items (as the probability will be 1 when we have k elements to choose from the remaining k elements).

我们也永远不会选择太多元素( n == m的概率将是 0 )。

Also we never pick too many elements (the probability will be 0 as soon n == m).

要证明样本是完全无偏的要难一点,但这是

It's a bit harder to demonstrate that the sample is completely unbiased, but it's


...正确,尽管我们不是选择 t + 1 st概率为 n / N 的项目。

... true in spite of the fact that we are not selecting the t + 1st item with probability n / N. This has caused some confusion in the published literature

(所以不仅仅是在Stackoverflow上)

(so not just on Stackoverflow!)

事实是我们不应该混淆条件概率和无条件概率:

The fact is we should not confuse conditional and unconditional probabilities:


例如,考虑第二个元素;如果在样本中选择了第一个元素(发生概率为 n / N ),则选择了第二个元素的概率为(n-1) /(N-1);如果未选择第一个元素,则选择第二个元素的概率为 n /(N-1)

For example consider the second element; if the first element was selected in the sample (this happen with probability n / N), the second element is selected with probability (n - 1) / (N - 1); if the first element was not selected, the second element is selected with probability n / (N - 1).

选择第二个元素的总概率为(n / N)((n-1)/(N-1))+(1-n / N)(n /(N-1) )= n / N

The overall probability of selecting the second element is (n / N) ((n - 1) / (N - 1)) + (1 - n/N)(n / (N - 1)) = n/N.

TAOCP-第2卷-第3.4节。 2练习3

除了理论上的考虑之外, Algorithm S (和 algorithm R > / 水库采样 )已在许多知名图书馆中使用(例如 SGI的原始STL实现 std :: experimental :: sample
random.sample 在P中ython ...)。

Apart from theoretical considerations, Algorithm S (and algorithm R / reservoir sampling) is used in many well known libraries (e.g. SGI's original STL implementation, std::experimental::sample, random.sample in Python...).

当然,算法S并非总是最佳答案:

Of course algorithm S is not always the best answer:


  • 它是 O(N)(即使我们通常不必跳过所有 N 条记录: n = 2 大约为 2/3 N 时考虑的平均记录数;常规公式在
    TAOCP-第2卷-第3.4.2节-ex 5/6中给出;

  • 不能使用提前不知道 N 的值

  • it's O(N) (even if we will usually not have to pass over all N records: the average number of records considered when n=2 is about 2/3 N; the general formulas are given in TAOCP - Vol 2 - § 3.4.2 - ex 5/6);
  • it cannot be used when the value of N isn't known in advance.

无论如何!


  1. C。 T. Fan,M。E. Muller和I. Rezucha,J。Amer。统计副会长57(1962),第387-402页

  2. T。 G.Jones,CACM 5(1962),第343页






编辑


您如何随机选择此项,概率为7/22

how do you randomly select this item, with a probability of 7/22

[CUT]

在极少数情况下,当您需要5个元素时甚至可以选择4个或6个元素

In rare cases, you might even pick 4 or 6 elements when you wanted 5

这来自 N3925 (为避免通用接口/标签分发而作的小修改):

This is from N3925 (small modifications to avoid the common interface / tag dispatch):

template<class PopIter, class SampleIter, class Size, class URNG>
SampleIter sample(PopIter first, PopIter last, SampleIter out, Size n, URNG &&g)
{
  using dist_t = uniform_int_distribution<Size>;
  using param_t = typename dist_t::param_type;

  dist_t d{};

  Size unsampled_sz = distance(first, last);
  for (n = min(n, unsampled_sz); n != 0;  ++first)
  {
    param_t const p{0, --unsampled_sz};

    if (d(g, p) < n) { *out++ = *first; --n; }
  }

  return out;
}

没有浮点数。


  • 如果需要5个元素,您将获得5个元素;

  • 如果 uniform_int_distribution 如广告所述,没有偏见。

  • If you need 5 elements you get 5 elements;
  • if uniform_int_distribution "works as advertised" there is no bias.

这篇关于从N个集合中随机选择n个记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆