为什么这个numpy数组太大,无法加载? [英] Why is this numpy array too big to load?

查看:1261
本文介绍了为什么这个numpy数组太大,无法加载?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个3.374Gb的npz文件, myfile.npz



文件名:

  a = np.load('myfile.npz')
a.files

给出

  ['arr_1','arr_0'] 

我可以读取'arr_1'ok


$ b $

  a1 = a ['arr_1'] 

然而,我无法载入 arr_0 ,或者读取它的形状:

  a1 = a ['arr_0'] 
a ['arr_0']。shape

上面的操作都给出了以下错误:

pre $ ValueError:数组太大

我有16Gb RAM,其中8.370Gb可用。所以这个问题似乎与记忆无关。我的问题是:


  1. 我应该能够读取这个文件吗?

  2. $ b $任何人都可以解释这个错误?
  3. 我一直在使用 np.memmap 解决这个问题 - 这是一个合理的方法吗?

  4. 我应该使用什么样的调试方法?$ /

编辑

48GB)并加载。 dtype 实际上是 complex128 a ['arr_0'] 是5750784000字节。看来可能需要RAM开销。无论是或我预测的可用内存量是错误的(我使用Windows sysinternals RAMmap)。 >(200,1440,3,13,32)应该占用约5.35GiB的未压缩,所以如果你真的有8.3GB的自由,可寻址的内存,那么原则上你应该能够加载数组。



然而,根据你在下面的评论中的反应,你正在使用Python和numpy的32位版本。在Windows中, 32位进程只能处理高达2GB的内存 (如果使用 IMAGE_FILE_LARGE_ADDRESS_AWARE 标志编译二进制文件,则大小为4GB;大多数32位的Python分发版本不是这样)。因此,无论您拥有多少物理内存,您的Python进程都被限制在2GB的地址空间。



您可以安装64位版本的Python,numpy和任何你需要的其他Python库,或者2GB的限制,并尝试解决它。在后一种情况下,您可能会放弃主要在磁盘上存储超过2GB限制的数组(例如使用 np.memmap ),但我建议您选择#1,因为在绝大多数情况下,对memmaped数组的操作要比完全驻留在RAM中的正常 np.array s慢很多。







$ b

如果你已经有另一台机器有足够的内存来将整个数组加载到核心内存中,那么我建议你以不同的格式保存数组一个普通的 np.memmap 使用 PyTable H5py )。从 .npz 文件中提取问题数组而不加载到RAM中也是可能的(尽管稍微有点棘手),以便可以打开它作为<$ c $

  import numpy as np 

#一些随机稀疏(可压缩)数据
x = np.random.RandomState(0).binomial(1,0.25,(1000,1000))

#将其另存为压缩.npz文件
np.savez_compressed('x_compressed.npz',x = x)

#现在将其加载为numpy.lib.npyio.NpzFile对象
obj = np.load('x_compressed.npz')

#包含格式为'< name> .npy'的存储阵列列表
namelist = obj.zip.namelist()

提取'x.npy'到当前目录
obj.zip.extract(namelist [0])

#现在我们可以打开数组a memmap
x_memmap = np.load(namelist [0],mmap_mode ='r +')

检查x和x_memmap是否相同
assert np.all(x = = x_memmap [:])


I have a 3.374Gb npz file, myfile.npz.

I can read it in and view the filenames:

a = np.load('myfile.npz')
a.files

gives

['arr_1','arr_0']

I can read in 'arr_1' ok

a1=a['arr_1']

However, I cannot load in arr_0, or read its shape:

a1=a['arr_0']
a['arr_0'].shape

both above operations give the following error:

ValueError: array is too big

I have 16Gb RAM of which 8.370Gb is available. So the problem doesn't seem related to memory. My questions are:

  1. Should I be able to read this file in?

  2. Can anyone explain this error?

  3. I have been looking at using np.memmap to get around this - is this a reasonable approach?

  4. What debugging approach should I use?

EDIT:

I got access to a computer with more RAM (48GB) and it loaded. The dtype was in fact complex128 and the uncompressed memory of a['arr_0'] was 5750784000 bytes. It seems that a RAM overhead may be required. Either that or my predicted amount of available RAM was wrong (I used windows sysinternals RAMmap).

解决方案

An np.complex128 array with dimensions (200, 1440, 3, 13, 32) ought to take up about 5.35GiB uncompressed, so if you really did have 8.3GB of free, addressable memory then in principle you ought to be able to load the array.

However, based on your responses in the comments below, you are using 32 bit versions of Python and numpy. In Windows, a 32 bit process can only address up to 2GB of memory (or 4GB if the binary was compiled with the IMAGE_FILE_LARGE_ADDRESS_AWARE flag; most 32 bit Python distributions are not). Consequently, your Python process is limited to 2GB of address space regardless of how much physical memory you have.

You can either install 64 bit versions of Python, numpy, and any other Python libraries you need, or live with the 2GB limit and try to work around it. In the latter case you might get away with storing arrays that exceed the 2GB limit mainly on disk (e.g. using np.memmap), but I'd advise you to go for option #1, since operations on memmaped arrays are a lot slower in most cases than for normal np.arrays that reside wholly in RAM.


If you already have another machine that has enough RAM to load the whole array into core memory then I would suggest you save the array in a different format (either as a plain np.memmap binary, or perhaps better, in an HDF5 file using PyTables or H5py). It's also possible (although slightly trickier) to extract the problem array from the .npz file without loading it into RAM, so that you can then open it as an np.memmap array residing on disk:

import numpy as np

# some random sparse (compressible) data
x = np.random.RandomState(0).binomial(1, 0.25, (1000, 1000))

# save it as a compressed .npz file
np.savez_compressed('x_compressed.npz', x=x)

# now load it as a numpy.lib.npyio.NpzFile object
obj = np.load('x_compressed.npz')

# contains a list of the stored arrays in the format '<name>.npy'
namelist = obj.zip.namelist()

# extract 'x.npy' into the current directory
obj.zip.extract(namelist[0])

# now we can open the array as a memmap
x_memmap = np.load(namelist[0], mmap_mode='r+')

# check that x and x_memmap are identical
assert np.all(x == x_memmap[:])

这篇关于为什么这个numpy数组太大,无法加载?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆