从未知长度的序列中随机挑选 N 个不同的项目,仅在一次迭代中 [英] Pick N distinct items at random from sequence of unknown length, in only one iteration

查看:22
本文介绍了从未知长度的序列中随机挑选 N 个不同的项目,仅在一次迭代中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一种算法,该算法可以从一个序列中随机挑选 N 个不同的项目,而无需事先知道序列的大小,以及多次迭代该序列的代价高昂的地方.例如,序列的元素可能是一个大文件的行.

I am trying to write an algorithm that would pick N distinct items from an sequence at random, without knowing the size of the sequence in advance, and where it is expensive to iterate over the sequence more than once. For example, the elements of the sequence might be the lines of a huge file.

当 N=1 时,我找到了一个解决方案(即从一个巨大的序列中随机选择一个元素"):

I have found a solution when N=1 (that is, "pick exactly one element at random from a huge sequence"):

import random
items = range(1, 10) # Imagine this is a huge sequence of unknown length
count = 1
selected = None
for item in items:
    if random.random() * count < 1:
        selected = item
    count += 1

但是对于 N 的其他值(例如,N=3),我如何才能实现相同的目标?

But how can I achieve the same thing for other values of N (say, N=3)?

推荐答案

使用水库采样.这是一个非常简单的算法,适用于任何 N.

Use reservoir sampling. It's a very simple algorithm that works for any N.

这里一个 Python 实现,这里是另一个.

Here is one Python implementation, and here is another.

这篇关于从未知长度的序列中随机挑选 N 个不同的项目,仅在一次迭代中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆