如果切片无法解决内存错误,如何合并两个大型numpy数组? [英] How to merge two large numpy arrays if slicing doesn't resolve memory error?
问题描述
我有两个numpy数组container1
和container2
,其中container1.shape = (900,4000)
和container2.shape = (5000,4000)
.使用vstack
合并它们将生成MemoryError
.在搜索了此处发布的旧问题之后,我尝试使用slicing
这样合并它们:
I have two numpy arrays container1
and container2
where container1.shape = (900,4000)
and container2.shape = (5000,4000)
. Merging them using vstack
results in a MemoryError
. After searching through the old questions posted here, I tried to merge them using slicing
like this:
mergedContainer = numpy.vstack((container1, container2[:1000]))
mergedContainer = numpy.vstack((mergedContainer, container[1000:2500]))
mergedContainer = numpy.vstack((mergedContainer, container[2500:3000]))
但是在此之后,即使我这样做:
but after this even if I do:
mergedContainer = numpy.vstack((mergedContainer, container[3000:3100]))
它生成MemoryError
.
我正在使用Python 3.4.3 (32-Bit)
,并且想解决而不转移到64-Bit
.
I am using Python 3.4.3 (32-Bit)
and would like to resolve without shifting to 64-Bit
.
推荐答案
每次调用np.vstack
时,NumPy都必须为全新的数组分配空间.
因此,如果说1行需要1单位内存
Every time you call np.vstack
NumPy has to allocate space for a brand new array.
So if we say 1 row requires 1 unit of memory
np.vstack([container, container2])
需要另外的 900+5000
个内存单元.而且,在进行分配之前,
Python还需要为旧的mergedContainer
保留空间(如果存在)
作为新mergedContainer
的空间.因此,建立mergedContainer
切片的迭代实际上比尝试构建它需要更多的内存
只需调用np.vstack
.
requires an additional 900+5000
units of memory. Moreover, before the assignment occurs,
Python needs to hold space for the old mergedContainer
(if it exists) as well
as space for the new mergedContainer
. So building mergedContainer
iteratively with slices actually requires more memory than trying to build it
with a single call to np.vstack
.
迭代构建:
| total | mergedContainer | container1 | container2 | temp | |
|-------+-----------------+------------+------------+------+----------------------------------------------------------------------|
| 7800 | 1900 | 900 | 5000 | 0 | mergedContainer = np.vstack((container1, container2[:1000])) |
| 11200 | 3400 | 900 | 5000 | 1900 | mergedContainer = np.vstack((mergedContainer, container[1000:2500])) |
| 13200 | 3900 | 900 | 5000 | 3400 | mergedContainer = np.vstack((mergedContainer, container[2500:3000])) |
通过一次调用np.vstack来构建它:
Building it from a single call to np.vstack:
| total | mergedContainer | container1 | container2 | temp | |
|-------+-----------------+------------+------------+------+-------------------------------------------------------|
| 11800 | 5900 | 900 | 5000 | 0 | mergedContainer = np.vstack((container1, container2)) |
但是,我们可以做得更好.而不是调用np.vstack
反复地一次分配一次所需的所有空间
,并写出container1
和
container2
放入其中.换句话说,避免分配两个不同的数组
container1
和container2
(如果最终知道的话)希望将它们合并.
We can do even better, however. Instead of calling np.vstack
repeatedly, allocate all the space that is needed once from
the very beginning and write the contents of both container1
and
container2
into it. In other words, avoid allocating two disparate arrays
container1
and container2
if you know eventually you want to merge them.
container = np.empty((5900, 4000))
请注意基本切片,例如container[:900]
始终返回视图,并且视图需要
基本上没有额外的内存.因此,您可以定义container1
和
container2
像这样:
Note that basic slices such as container[:900]
always return views, and views require
essentially no additional memory. So you could define container1
and
container2
like this:
container1 = container[:900]
container2 = container[900:]
,然后就地分配值.这会修改container
:
and then assign values in place. This modifies container
:
container1[:] = ...
container2[:] = ...
因此,您的内存需求将保持在5900单位左右.
Thus your your memory requirement would stay around 5900 units.
例如,
import numpy as np
np.random.seed(2015)
container = np.empty((5, 4), dtype='int')
container1 = container[:2]
container2 = container[2:]
container1[:] = np.random.randint(10, size=(2,4))
container2[:] = np.random.randint(1000, size=(3,4))
print(container)
收益
[[ 2 2 9 6]
[ 8 5 7 8]
[112 70 487 124]
[859 8 275 936]
[317 134 393 909]]
同时只需要一个形状(5,4)数组的空间,以及临时数组的临时使用空间.
while only requiring space for one array of shape (5, 4), and temporarly-used space for the random arrays.
因此,您无需在代码中进行太多更改即可节省内存.只需使用
Thus, you wouldn't have to change very much in your code to save memory. Just set it up with
container = np.empty((5900, 4000))
container1 = container[:900]
container2 = container[900:]
然后使用
container1[:] = ...
代替
container1 = ...
就地分配值 . (或者,当然,您可以直接写到container
.)
to assign values in-place. (Or, of course, you could just write directly into container
.)
这篇关于如果切片无法解决内存错误,如何合并两个大型numpy数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!