从链表中有效地选择一组随机元素 [英] Efficiently selecting a set of random elements from a linked list

查看:31
本文介绍了从链表中有效地选择一组随机元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个长度为 N 的数字链表.N 非常大,我事先不知道 N 的确切值.

Say I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.

我怎样才能最有效地编写一个从列表中完全返回k随机数的函数?

How can I most efficiently write a function that will return k completely random numbers from the list?

推荐答案

有一个非常好的和高效的算法,使用一种叫做水库采样的方法.

There's a very nice and efficient algorithm for this using a method called reservoir sampling.

让我先给您介绍一下它的历史:

Let me start by giving you its history:

Knuth 在 p 上调用这个算法 R.他 1997 年版 Seminumerical Algorithms(计算机编程艺术第 2 卷)的第 144 条,并在那里提供了一些代码.Knuth 将该算法归功于 Alan G. Waterman.尽管进行了长时间的搜索,我还是没能找到 Waterman 的原始文档(如果它存在的话),这可能就是为什么您最常看到 Knuth 被引用为该算法的来源.

Knuth calls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven't been able to find Waterman's original document, if it exists, which may be why you'll most often see Knuth quoted as the source of this algorithm.

McLeod 和 Bellhouse,1983 年 (1) 提供了比 Knuth 更彻底的讨论,以及第一个公布的证明(我知道)该算法有效.

McLeod and Bellhouse, 1983 (1) provide a more thorough discussion than Knuth as well as the first published proof (that I'm aware of) that the algorithm works.

Vitter 1985 (2) 回顾了算法 R,然后提出了另外三种算法,它们提供相同的输出,但有所不同.他的算法不是选择包含或跳过每个传入元素,而是预先确定要跳过的传入元素的数量.在他的测试中(无可否认,这些测试现在已经过时)通过避免随机数的生成和对每个传入数字的比较,大大减少了执行时间.

Vitter 1985 (2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.

伪代码中的算法是:

Let R be the result array of size s
Let I be an input queue

> Fill the reservoir array
for j in the range [1,s]:
  R[j]=I.pop()

elements_seen=s
while I is not empty:
  elements_seen+=1
  j=random(1,elements_seen)       > This is inclusive
  if j<=s:
    R[j]=I.pop()
  else:
    I.pop()

请注意,我专门编写了代码以避免指定输入的大小.这是这个算法的一个很酷的特性:你可以运行它而无需事先知道输入的大小,它仍然确保你遇到的每个元素都有相同的概率结束于 R(即没有偏差).此外,R 包含算法一直考虑的元素的公平且具有代表性的样本.这意味着您可以将其用作在线算法.

Note that I've specifically written the code to avoid specifying the size of the input. That's one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it still assures you that each element you encounter has an equal probability of ending up in R (that is, there is no bias). Furthermore, R contains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an online algorithm.

为什么会这样?

McLeod 和 Bellhouse (1983) 提供了使用组合数学的证明.它很漂亮,但在这里重建它会有点困难.因此,我生成了一个更容易解释的替代证明.

McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It's pretty, but it would be a bit difficult to reconstruct it here. Therefore, I've generated an alternative proof which is easier to explain.

我们通过归纳法进行证明.

We proceed via proof by induction.

假设我们想要生成一组 s 元素并且我们已经看到了 n>s 个元素.

Say we want to generate a set of s elements and that we have already seen n>s elements.

假设我们当前的 s 元素已经以 s/n 的概率被选中.

Let's assume that our current s elements have already each been chosen with probability s/n.

根据算法的定义,我们以s/(n+1)的概率选择元素n+1.

By the definition of the algorithm, we choose element n+1 with probability s/(n+1).

每个已经成为结果集一部分的元素都有1/s被替换的概率.

Each element already part of our result set has a probability 1/s of being replaced.

n-seen 结果集中的一个元素在 n+1-seen 结果集中被替换的概率是 (1/s)*s/(n+1)=1/(n+1).反之,一个元素未被替换的概率为1-1/(n+1)=n/(n+1).

The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).

因此,n+1-seen 结果集包含一个元素,如果它是 n-seen 结果集的一部分并且没有被替换---this概率是 (s/n)*n/(n+1)=s/(n+1)---或者如果元素被选择---概率 s/(n+1).

Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)---or if the element was chosen---with probability s/(n+1).

算法的定义告诉我们,第一个 s 元素被自动包含为结果集的第一个 n=s 成员.因此,n-seen 结果集包含具有 s/n (=1) 概率的每个元素,为我们提供了必要的归纳基础情况.

The definition of the algorithm tells us that the first s elements are automatically included as the first n=s members of the result set. Therefore, the n-seen result set includes each element with s/n (=1) probability giving us the necessary base case for the induction.

参考资料

  1. 麦克劳德、A. 伊恩和大卫 R. 贝尔豪斯.一种用于绘制简单随机样本的便捷算法."皇家统计学会杂志.系列 C(应用统计)32.2(1983):182-184.(链接)

Vitter、Jeffrey S.水库随机采样."ACM 数学软件交易 (TOMS) 11.1 (1985):37-57.(链接)

Vitter, Jeffrey S. "Random sampling with a reservoir." ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57. (Link)

这篇关于从链表中有效地选择一组随机元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆