有效地从链表中选择一组随机元素的 [英] Efficiently selecting a set of random elements from a linked list

查看:172
本文介绍了有效地从链表中选择一组随机元素的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有长度为N N个数的链接列表是非常大的,我事先不知道N的精确值做

Say I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.

我怎样才能最有效地编写一个函数,将复位K完全随机的号码从名单?

How can I most efficiently write a function that will return k completely random numbers from the list?

推荐答案

有一个非常漂亮的,高效的算法,这种使用方法叫水库取样

There's a very nice and efficient algorithm for this using a method called reservoir sampling.

让我给你的记录启动:

克努特调用这个算法R于页。 144他的1997年版半数值算法(计算机程序设计艺术第2卷),并提供了一​​些$ C $下它。克努特属性的算法,艾伦G.沃特曼。尽管长时间的寻找,我一​​直没能找到沃特曼的原文件,如果存在的话,这可能是为什么你会经常看到克努特引述该算法的来源。

Knuth calls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven't been able to find Waterman's original document, if it exists, which may be why you'll most often see Knuth quoted as the source of this algorithm.

麦克劳德和Bellhouse,1983 (1)提供比克努特一个更深入的讨论,以及首次公布的证据(即我所知道的),该算法的作品。

McLeod and Bellhouse, 1983 (1) provide a more thorough discussion than Knuth as well as the first published proof (that I'm aware of) that the algorithm works.

维特1985年(2)审查算法R和再presents另外三个算法,它提供了相同的输出,但与一捻。而不是作出选择,以包括或跳过每个传入元件,他的算法predetermines要跳过的传入元件的数目。在他的测试中(其中,无可否认,是过时了),这个下降的执行时间大大避免在每个-未来数随机数生成和比较。

Vitter 1985 (2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.

伪code 的算法是:

Let R by the result array of size s
Let I be an input queue

> Fill the reservoir array
for j in the range [1,s]:
  R[j]=I.pop()

elements_seen=s
while I is not empty:
  elements_seen+=1
  j=random(1,elements_seen)       > This is inclusive
  if j<=s:
    R[j]=I.pop()
  else:
    I.pop()

请注意,我已经专门写了code,避免指定输入的大小。这就是该算法的清凉属性之一:你可以运行它,而不需要知道输入的大小事先并的还是的向你保证,你遇到的每个元素在结束了一个相同的概率研究(即,没有偏见)。此外,研究包含算法已考虑在所有时间要素的公平和再presentative样本。这意味着你可以使用它作为一个在线算法

Note that I've specifically written the code to avoid specifying the size of the input. That's one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it still assures you that each element you encounter has an equal probability of ending up in R (that is, there is no bias). Furthermore, R contains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an online algorithm.

为什么这项工作?

麦克劳德和Bellhouse(1983)提供了使用的组合数学证明。这是pretty的,但是这将是一个有点困难,在这里重建。因此,我产生了另一个证明是更容易解释。

McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It's pretty, but it would be a bit difficult to reconstruct it here. Therefore, I've generated an alternative proof which is easier to explain.

我们通过继续通过归纳证明。

We proceed via proof by induction.

说,我们要生成一组取值元素,而我们已经看到 N'GT; S 元素。

Say we want to generate a set of s elements and that we have already seen n>s elements.

让我们假设,我们目前的取值元素都已经各自被选择的概率 S / N

Let's assume that our current s elements have already each been chosen with probability s/n.

通过算法的定义,我们选择的元素 N + 1 的概率 S /(N + 1)

By the definition of the algorithm, we choose element n+1 with probability s/(n+1).

每个元素已经是我们的结果集的一部分,有一个概率 1 / S 被取代的。

Each element already part of our result set has a probability 1/s of being replaced.

的概率从 N -seen结果集被替换的 N + 1 -seen元素因此,结果集是(1 /秒)* S /(N + 1)= 1 /(N + 1)。相反,一个元素不会被替换的概率 1-1 /(N + 1)= N /(N + 1)

The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).

因此​​, N + 1 -seen结果集包含的元素或者如果它是一部分的 N -seen结果集,并没有取代---这个概率是(S / N)* N /(N + 1)= S /(N + 1) - -or如果元素被选择---概率 S /(N + 1)

Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)---or if the element was chosen---with probability s/(n+1).

该算法的定义告诉我们,第一个取值元素将自动包括作为第一个ň= S 结果集的成员。因此,正看到结果集包括与每个元素的S / N (= 1)的概率给予我们必要的基本情况进行了归纳。

The definition of the algorithm tells us that the first s elements are automatically included as the first n=s members of the result set. Therefore, the n-seen result set includes each element with s/n (=1) probability giving us the necessary base case for the induction.

引用

  1. 麦克劳德,A伊恩和David R. Bellhouse。 一个方便的算法,绘制简单随机样本。杂志皇家统计学会的。 C系列(应用统计),32.2(1983):182-184。 (链接

维特,杰弗里S.与水库随机抽样。 ACM交易在数学软件(TOMS)11.1(1985):37-57。 (<一href="http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/papers/RandomSampling/1985-Vitter-Random-sampling-with-reservior.pdf">Link)

Vitter, Jeffrey S. "Random sampling with a reservoir." ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57. (Link)

这篇关于有效地从链表中选择一组随机元素的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆