如果切片无法解决内存错误,如何合并两个大型numpy数组? [英] How to merge two large numpy arrays if slicing doesn't resolve memory error?

查看:79
本文介绍了如果切片无法解决内存错误,如何合并两个大型numpy数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个numpy数组container1container2,其中container1.shape = (900,4000)container2.shape = (5000,4000).使用vstack合并它们将生成MemoryError.在搜索了此处发布的旧问题之后,我尝试使用slicing这样合并它们:

I have two numpy arrays container1 and container2 where container1.shape = (900,4000) and container2.shape = (5000,4000). Merging them using vstack results in a MemoryError. After searching through the old questions posted here, I tried to merge them using slicing like this:

mergedContainer = numpy.vstack((container1, container2[:1000]))
mergedContainer = numpy.vstack((mergedContainer, container[1000:2500]))
mergedContainer = numpy.vstack((mergedContainer, container[2500:3000]))

但是在此之后,即使我这样做:

but after this even if I do:

mergedContainer = numpy.vstack((mergedContainer, container[3000:3100]))

它生成MemoryError.

我正在使用Python 3.4.3 (32-Bit),并且想解决而不转移到64-Bit.

I am using Python 3.4.3 (32-Bit) and would like to resolve without shifting to 64-Bit.

推荐答案

每次调用np.vstack时,NumPy都必须为全新的数组分配空间. 因此,如果说1行需要1单位内存

Every time you call np.vstack NumPy has to allocate space for a brand new array. So if we say 1 row requires 1 unit of memory

np.vstack([container, container2])

需要另外的 900+5000个内存单元.而且,在进行分配之前, Python还需要为旧的mergedContainer保留空间(如果存在) 作为新mergedContainer的空间.因此,建立mergedContainer 切片的迭代实际上比尝试构建它需要更多的内存 只需调用np.vstack.

requires an additional 900+5000 units of memory. Moreover, before the assignment occurs, Python needs to hold space for the old mergedContainer (if it exists) as well as space for the new mergedContainer. So building mergedContainer iteratively with slices actually requires more memory than trying to build it with a single call to np.vstack.

迭代构建:

| total | mergedContainer | container1 | container2 | temp |                                                                      |
|-------+-----------------+------------+------------+------+----------------------------------------------------------------------|
|  7800 |            1900 |        900 |       5000 |    0 | mergedContainer = np.vstack((container1, container2[:1000]))         |
| 11200 |            3400 |        900 |       5000 | 1900 | mergedContainer = np.vstack((mergedContainer, container[1000:2500])) |
| 13200 |            3900 |        900 |       5000 | 3400 | mergedContainer = np.vstack((mergedContainer, container[2500:3000])) |

通过一次调用np.vstack来构建它:

Building it from a single call to np.vstack:

| total | mergedContainer | container1 | container2 | temp |                                                       |
|-------+-----------------+------------+------------+------+-------------------------------------------------------|
| 11800 |            5900 |        900 |       5000 |    0 | mergedContainer = np.vstack((container1, container2)) |


但是,我们可以做得更好.而不是调用np.vstack 反复地一次分配一次所需的所有空间 ,并写出container1container2放入其中.换句话说,避免分配两个不同的数组 container1container2(如果最终知道的话)希望将它们合并.


We can do even better, however. Instead of calling np.vstack repeatedly, allocate all the space that is needed once from the very beginning and write the contents of both container1 and container2 into it. In other words, avoid allocating two disparate arrays container1 and container2 if you know eventually you want to merge them.

container = np.empty((5900, 4000))

请注意基本切片,例如container[:900]始终返回视图,并且视图需要 基本上没有额外的内存.因此,您可以定义container1container2像这样:

Note that basic slices such as container[:900] always return views, and views require essentially no additional memory. So you could define container1 and container2 like this:

container1 = container[:900]   
container2 = container[900:]   

,然后就地分配.这会修改container:

and then assign values in place. This modifies container:

container1[:] = ...              
container2[:] = ...

因此,您的内存需求将保持在5900单位左右.

Thus your your memory requirement would stay around 5900 units.

例如,

import numpy as np
np.random.seed(2015)

container = np.empty((5, 4), dtype='int')
container1 = container[:2]   
container2 = container[2:]   
container1[:] = np.random.randint(10, size=(2,4))
container2[:] = np.random.randint(1000, size=(3,4))
print(container)

收益

[[  2   2   9   6]
 [  8   5   7   8]
 [112  70 487 124]
 [859   8 275 936]
 [317 134 393 909]]

同时只需要一个形状(5,4)数组的空间,以及临时数组的临时使用空间.

while only requiring space for one array of shape (5, 4), and temporarly-used space for the random arrays.

因此,您无需在代码中进行太多更改即可节省内存.只需使用

Thus, you wouldn't have to change very much in your code to save memory. Just set it up with

container = np.empty((5900, 4000))
container1 = container[:900]   
container2 = container[900:]   

然后使用

container1[:] = ...

代替

container1 = ...

就地分配值 . (或者,当然,您可以直接写到container.)

to assign values in-place. (Or, of course, you could just write directly into container.)

这篇关于如果切片无法解决内存错误,如何合并两个大型numpy数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆