从(伪)随机顺序的大型列表中有效地产生元素 [英] Efficiently yield elements from large list in (pseudo) random order
问题描述
我正在尝试展开一些嵌套循环,以(可能)更好的性能为代价,但要消耗内存.在我的场景中,最终将得到大约3亿个元素(元组)的列表,而这些元素必须按(或多或少)随机顺序产生.
I am experimenting with unrolling a few nested loops for (potentially) better performance at the expense of memory. In my scenario, I would end up with a list of about 300M elements (tuples), which I'd have to yield in (more or less) random order.
在这个数量级上,random.shuffle(some_list)
真的不再可行.
At this order of magnitude, random.shuffle(some_list)
really is not the way to go anymore.
以下示例说明了此问题.请注意,在x86_64 Linux和CPython 3.6.4上,它将占用大约11 GB的内存.
The below example illustrates the issue. Be aware, on an x86_64 Linux and CPython 3.6.4, it will eat about 11 GByte of memory.
def get_random_element():
some_long_list = list(range(0, 300000000))
for random_item in some_long_list:
yield random_item
到目前为止,我的想法是每次迭代仅生成一个随机索引,并从列表中(无限期地)产生随机选择的元素.它可能会多次产生某些元素而完全跳过其他元素,这是一个值得权衡的问题.
My thinking so far is to simply generate one random index per iteration and yield randomly picked elements (indefinitely) from the list. It may yield certain elements multiple times and totally skip others, which would be a trade-off worth considering.
在合理的内存和CPU时间范围内,我还有哪些其他选择可以使列表中的每个元素仅产生一次?
What other options do I have within reasonable bounds of memory and CPU time to possibly yield every element of the list only once?
推荐答案
Here is Fisher-Yates-Knuth in-place sampling (https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle)
内存稳定在〜4Gb(是,我使用的是100000000)
Memory was stable ~4Gb (yes, I was using 100000000)
# Fisher-Yates-Knuth sampling, in-place Durstenfeld version
import numpy as np
def swap(data, posA, posB):
if posA != posB:
data[posB], data[posA] = data[posA], data[posB]
def get_random_element(data, datalen):
pos = datalen
while pos > 0:
idx = np.random.randint(low=0, high=pos) # sample in the [0...pos) range
pos -= 1
swap(data, idx, pos)
yield data[pos]
length = 100000000
some_long_list = list(range(0, length))
gen = get_random_element(some_long_list, length)
for k in range(0, length):
print(next(gen))
更新
为了提高速度,您可能还希望内联swap()
For speed, you might want to inline swap() as well
这篇关于从(伪)随机顺序的大型列表中有效地产生元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!