如何在Python中混洗存储在文件中的非常大的列表? [英] How can I shuffle a very large list stored in a file in Python?

查看:49
本文介绍了如何在Python中混洗存储在文件中的非常大的列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要确定性地生成一个随机列表,其中包含从0到2 ^ 32-1的数字.

这将是幼稚的(并且完全不起作用)的方式,只是很清楚我想要什么.

随机导入数字=范围(2 ** 32)随机种子(0)random.shuffle(数字) 

我尝试使用 numpy.arange()制作列表,并使用pycrypto的 random.shuffle()对其进行混洗.列表中吃了大约8gb的ram,然后改组将其提高到25gb左右.我只有32GB可用.但这没关系,因为...

我尝试将列表切成1024个切片并尝试上述操作,但是即使其中一个切片也花费了太长时间.我将其中一个切片切成128个较小的切片,每个 耗时约620毫秒.如果它呈线性增长,则意味着整个过程将需要大约22个半小时才能完成.听起来不错,但不会线性增长.

我尝试过的另一件事是为每个条目生成随机数,并将其用作新位置的索引.然后,我进入列表并尝试将数字放置在新索引处.如果该索引已在使用中,则该索引会递增,直到找到可用的索引为止.从理论上讲,它可以完成大约一半的工作,但是直到最后,它仍然不得不寻找新的位置,并在列表中多次缠绕.

有什么办法做到这一点?这是一个可行的目标吗?

解决方案

计算所有值似乎是不可能的,因为 Crypto 会在大约一毫秒内计算一个随机整数,因此整个工作需要几天的时间./p>

以下是作为生成器的Knuth算法实现:

从Crypto.Random.random导入randint的

 将numpy导入为npdef onthefly(n):数字= np.arange(n,dtype = np.uint32)对于范围(n)中的i:j = randint(i,n-1)数字[i],数字[j] =数字[j],数字[i]产量数字[i] 

对于 n = 10 :

  gen = onthefly(10)打印([i的下一个(gen)在范围(9)中]]打印(下一个(gen))#[9,0,2,6,6,4,8,7,3,1]#5 

对于 n = 2 ** 32 ,生成器需要花一分钟的时间进行初始化,但调用为O(1).

I need to deterministically generate a randomized list containing the numbers from 0 to 2^32-1.

This would be the naive (and totally nonfunctional) way of doing it, just so it's clear what I'm wanting.

import random
numbers = range(2**32)
random.seed(0)
random.shuffle(numbers)

I've tried making the list with numpy.arange() and using pycrypto's random.shuffle() to shuffle it. Making the list ate up about 8gb of ram, then shuffling raised that to around 25gb. I only have 32gb to give. But that doesn't matter because...

I've tried cutting the list into 1024 slices and trying the above, but even one of these slices takes way too long. I cut one of these slices into 128 yet smaller slices, and that took about 620ms each. If it grew linearly, then that means the whole thing would take about 22 and a half hours to complete. That sounds alright, but it doesn't grow linearly.

Another thing I've tried is generating random numbers for every entry and using those as indices for their new location. I then go down the list and attempt to place the number at the new index. If that index is already in use, the index is incremented until it finds a free one. This works in theory, and it can do about half of it, but near the end it keeps having to search for new spots, wrapping around the list several times.

Is there any way to pull this off? Is this a feasible goal at all?

解决方案

Computing all the values seems impossible, since Crypto compute a random integer in about a milisecond, so the whole job take days.

Here is a Knuth algorithm implementation as a generator:

from Crypto.Random.random import randint  
import numpy as np

def onthefly(n):
    numbers=np.arange(n,dtype=np.uint32)
    for i in range(n):
        j=randint(i,n-1)
        numbers[i],numbers[j]=numbers[j],numbers[i]
        yield numbers[i]

For n=10 :

gen=onthefly(10)
print([next(gen) for i in range(9)])
print(next(gen))
#[9, 0, 2, 6, 4, 8, 7, 3, 1]
#5

For n=2**32, the generator take a minute to initialize, but calls are O(1).

这篇关于如何在Python中混洗存储在文件中的非常大的列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆