如何合并非常大的numpy数组? [英] How to merge very large numpy arrays?

查看:106
本文介绍了如何合并非常大的numpy数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将有许多 Numpy

I will have many Numpy arrays stored in npz files, which are being saved using savez_compressed function.

我将信息拆分为许多数组,因为如果没有使用,由于内存问题,我正在使用的功能会崩溃.数据不稀疏.

I am splitting the information in many arrays because, if not, the functions I am using crash due to memory issues. The data is not sparse.

我需要将所有信息合并到一个唯一的数组中(以便能够使用某些例程对其进行处理),并将其存储到磁盘中(以使用不同的参数对其进行多次处理).

I will need to joint all that info in one unique array (to be able to process it with some routines), and store it into disk (to process it many times with diffente parameters).

阵列无法容纳在RAM +交换内存中.

Arrays won't fit into RAM+swap memory.

如何将它们合并成一个唯一的数组并将其保存到磁盘?

我怀疑我应该使用 mmap_mode ,但我不知道具体如何.另外,我想如果我一开始不保留连续的磁盘空间,那可能是一些性能问题.

I suspect that I should use mmap_mode, but I do not realize exactly how. Also, I imagine that can be some performance issues if I do not reserve contiguous disk space at first.

我已经阅读了这篇帖子,但我仍然不知道该怎么做.

I have read this post but I still cannot realize how to do it.

编辑

说明:我做了很多函数来处理相似的数据,其中一些需要数组作为参数.在某些情况下,我可以通过切片将它们仅传递给这个大型数组的一部分.但是拥有所有信息仍然很重要.在这样的阵列中.

Clarification: I have made many functions to process similar data, some of them require an array as argument. In some cases I could pass them only part of this large array by using slicing. But it is still important to have all the info. in such an array.

这是由于以下原因:数组包含按时间排序的信息(来自物理模拟).在函数的参数中,用户可以设置处理的初始时间和最后时间.同样,他/她可以设置处理块的大小(这很重要,因为这会影响性能,但允许的块大小取决于计算资源).因此,我无法将数据存储为单独的块.

This is because of the following: The arrays contain information (from physical simulations) time ordered. Among the argument of the functions, the user can set the initial and last time to process. Also, he/she can set the size of the processing chunk (which is important because this affect to the performance but allowed chunk size depend on the computational resources). Because of this, I cannot store the data as separated chunks.

此特定数组(我要创建的数组)的构建方式在工作时并不重要.

The way in which this particular array (the one I am trying to create) is built is not important while it works.

推荐答案

您应该能够在np.memap数组上逐块加载:

You should be able to load chunk by chunk on a np.memap array:

import numpy as np

data_files = ['file1.npz', 'file2.npz2', ...]

# If you do not know the final size beforehand you need to
# go through the chunks once first to check their sizes
rows = 0
cols = None
dtype = None
for data_file in data_files:
    with np.load(data_file) as data:
        chunk = data['array']
        rows += chunk.shape[0]
        cols = chunk.shape[1]
        dtype = chunk.dtype

# Once the size is know create memmap and write chunks
merged = np.memmap('merged.buffer', dtype=dtype, mode='w+', shape=(rows, cols))
idx = 0
for data_file in data_files:
    with np.load(data_file) as data:
        chunk = data['array']
        merged[idx:idx + len(chunk)] = chunk
        idx += len(chunk)

但是,正如注释中指出的那样,跨过并非最快的维度会非常慢.

However, as pointed out in the comments working across a dimension which is not the fastest one will be very slow.

这篇关于如何合并非常大的numpy数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆