使用numpy创建大型随机布尔矩阵 [英] Create large random boolean matrix with numpy
问题描述
我正在尝试创建一个巨大的boolean
矩阵,该矩阵以给定的概率p
随机填充True
和False
.最初,我使用以下代码:
I am trying to create a huge boolean
matrix which is randomly filled with True
and False
with a given probability p
. At first I used this code:
N = 30000
p = 0.1
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
但是可悲的是,对于这么大的N
来说,它似乎并没有终止.因此,我尝试通过执行以下操作将其拆分为单行的生成:
But sadly it does not seem to terminate for this big N
. So I tried to split it up into the generation of the single rows by doing this:
N = 30000
p = 0.1
mask = np.empty((N, N))
for i in range (N):
mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
if (i % 100 == 0):
print(i)
现在,发生了一件奇怪的事情(至少在我的设备上如此):前1100行非常快速地生成-但是在此之后,代码变得非常慢.为什么会这样呢?我在这里想念什么?是否有更好的方法来创建一个大矩阵,该矩阵具有概率为p
的True
项和概率为1-p
的False
项?
Now, there happens something strange (at least on my device): The first ~1100 rows are very fastly generated - but after it, the code becomes horribly slow. Why is this happening? What do I miss here? Are there better ways to create a big matrix which has True
entries with probability p
and False
entries with probability 1-p
?
编辑:许多人都认为RAM将是一个问题:因为运行代码的设备将近500GB RAM,所以这不会成为问题.
Edit: As many of you assumed that the RAM will be a problem: As the device which will run the code has almost 500GB RAM, this won't be a problem.
推荐答案
问题是您的RAM,值在创建时就存储在内存中.我刚刚使用以下命令创建了此矩阵:
The problem is your RAM, the values are being stored in memory as it's being created. I just created this matrix using this command:
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
我使用了具有64GB RAM和8个内核的AWS i3
实例.要创建此矩阵,htop
显示它占用约20GB的RAM.这是一个基准,以防万一:
I used an AWS i3
instance with 64GB of RAM and 8 cores. To create this matrix, htop
shows that it takes up ~20GB of RAM. Here is a benchmark in case you care:
time np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
CPU times: user 18.3 s, sys: 3.4 s, total: 21.7 s
Wall time: 21.7 s
def mask_method(N, p):
for i in range(N):
mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
if (i % 100 == 0):
print(i)
time mask_method(N,p)
CPU times: user 20.9 s, sys: 1.55 s, total: 22.5 s
Wall time: 22.5 s
请注意,mask方法在峰值时仅占用约9GB的RAM.
Note that the mask method only takes up ~9GB of RAM at it's peak.
第一个方法在处理完成后会刷新RAM,而函数方法会保留所有RAM.
The first method flushes the RAM after the process is done where as the function method retains all of it.
这篇关于使用numpy创建大型随机布尔矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!