从(伪)随机顺序的大型列表中有效地产生元素 [英] Efficiently yield elements from large list in (pseudo) random order

查看:129
本文介绍了从(伪)随机顺序的大型列表中有效地产生元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试展开一些嵌套循环,以(可能)更好的性能为代价,但要消耗内存.在我的场景中,最终将得到大约3亿个元素(元组)的列表,而这些元素必须按(或多或少)随机顺序产生.

I am experimenting with unrolling a few nested loops for (potentially) better performance at the expense of memory. In my scenario, I would end up with a list of about 300M elements (tuples), which I'd have to yield in (more or less) random order.

在这个数量级上,random.shuffle(some_list)真的不再可行.

At this order of magnitude, random.shuffle(some_list) really is not the way to go anymore.

以下示例说明了此问题.请注意,在x86_64 Linux和CPython 3.6.4上,它将占用大约11 GB的内存.

The below example illustrates the issue. Be aware, on an x86_64 Linux and CPython 3.6.4, it will eat about 11 GByte of memory.

def get_random_element():
    some_long_list = list(range(0, 300000000))
    for random_item in some_long_list:
        yield random_item

到目前为止,我的想法是每次迭代仅生成一个随机索引,并从列表中(无限期地)产生随机选择的元素.它可能会多次产生某些元素而完全跳过其他元素,这是一个值得权衡的问题.

My thinking so far is to simply generate one random index per iteration and yield randomly picked elements (indefinitely) from the list. It may yield certain elements multiple times and totally skip others, which would be a trade-off worth considering.

在合理的内存和CPU时间范围内,我还有哪些其他选择可以使列表中的每个元素仅产生一次?

What other options do I have within reasonable bounds of memory and CPU time to possibly yield every element of the list only once?

推荐答案

这是Fisher-Yates-Knuth就地采样(

Here is Fisher-Yates-Knuth in-place sampling (https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle)

内存稳定在〜4Gb(是,我使用的是100000000)

Memory was stable ~4Gb (yes, I was using 100000000)

# Fisher-Yates-Knuth sampling, in-place Durstenfeld version

import numpy as np

def swap(data, posA, posB):
    if posA != posB:
        data[posB], data[posA] = data[posA], data[posB]

def get_random_element(data, datalen):
    pos = datalen

    while pos > 0:
        idx = np.random.randint(low=0, high=pos) # sample in the [0...pos) range

        pos -= 1
        swap(data, idx, pos)

        yield data[pos]


length = 100000000
some_long_list = list(range(0, length))

gen = get_random_element(some_long_list, length)

for k in range(0, length):
    print(next(gen))

更新

为了提高速度,您可能还希望内联swap()

For speed, you might want to inline swap() as well

这篇关于从(伪)随机顺序的大型列表中有效地产生元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆