添加存储在HDF5数据集中的大矩阵 [英] Adding big matrices stored in HDF5 datasets

查看:80
本文介绍了添加存储在HDF5数据集中的大矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个具有相同结构的HDF5文件,每个文件存储一个相同形状的矩阵.我需要创建第三个HDF5文件,其中的矩阵表示上述两个矩阵的逐元素和.考虑到矩阵的大小非常大(在Gb-Tb范围内),最好的方法是并行处理?我正在使用HDF5库的h5py接口.有图书馆可以做到吗?

I have two HDF5 files having an identical structure, each store a matrix of the same shape. I need to create a third HDF5 file with a matrix representing the element-wise sum of the two mentioned above matrices. Given the sizes of matrices are extremely large (in the Gb-Tb range), what would be the best way to do it, preferably in a parallel way? I am using the h5py interface to the HDF5 library. Are there any libraries capable of doing it?

推荐答案

是的,这是可能的.关键是要访问文件1和文件2中的数据切片.file2,按元素求和,然后将新数据片写入file3.您可以使用h5py或PyTables(又称表格)来执行此操作.不需要其他库.我只有并行计算的相关知识.我知道h5py通过mpi4py Python包支持mpi接口.此处的详细信息: h5py文档:并行HDF5

Yes, this is possible. The key is to access slices of the data from file1 & file2, do your element-wise sum, then write that slice of new data to the file3. You can do this with h5py or PyTables (aka tables). No other libraries are required. I only have passing knowledge of parallel computing. I know h5py supports an mpi interface through the mpi4py Python package. Details here: h5py docs: Parallel HDF5

这是一个简单的例子.它使用随机浮点数据集 shape =(10,10,10)创建2个文件.然后,它将创建一个具有相同形状的空数据集的新文件.循环从file1和file2读取数据切片,将其求和,然后写入file3中的同一切片.要测试大数据,可以修改形状以匹配文件.
2021年1月21日更新:
我添加了代码以从file1和file2获取数据集形状,并进行比较(以确保它们相等).如果形状不相等,我退出.如果它们匹配,则创建新文件,然后创建形状匹配的数据集.(如果您确实想变得健壮,则可以对dtype进行相同操作.)我还使用 shape [2] 的值作为数据集上的切片迭代器.

Here is a simple example. It creates 2 files with a dataset of random floats, shape=(10,10,10). It then creates a new file with an empty dataset of the same shape. The loop reads a slice of data from file1 and file2, sums them, then writes to the same slice in file3. To test with large data, you can modify the shapes to match your file.
21-Jan-2021 Update:
I added code to get the dataset shapes from file1 and file2, and compare them (to be sure they are equal). If the shapes aren't equal, I exit. If they match, I create the new file, then create a dataset of matching shape. (If you really want to be robust, you could do the same with the dtype.) I also use the value of shape[2] as the slice iterator over the dataset.

import h5py
import numpy as np
import random
import sys

arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file1.h5','w') as h5fw :
    h5fw.create_dataset('data_1',data=arr)

arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file2.h5','w') as h5fw :
    h5fw.create_dataset('data_2',data=arr)

h5fr1 = h5py.File('file1.h5','r')
f1shape = h5fr1['data_1'].shape
h5fr2 = h5py.File('file2.h5','r')
f2shape = h5fr2['data_2'].shape

if (f1shape!=f2shape):
    print ('Datasets shapes do not match')
    h5fr1.close()
    h5fr2.close()
    sys.exit('Exiting due to error.') 
         
else:
    with h5py.File('file3.h5','w') as h5fw :
        ds3 = h5fw.create_dataset('data_3', shape=f1shape, dtype='f')
    
        for i in range(f1shape[2]):
            arr1_slice = h5fr1['data_1'][:,:,i]
            arr2_slice = h5fr2['data_2'][:,:,i]
            arr3_slice = arr1_slice + arr2_slice
            ds3[:,:,i] = arr3_slice
        
        #     alternately, you can slice and sum in 1 line
        #     ds3[:,:,i] = h5fr1['data_1'][:,:,i] + \
        #                  h5fr2['data_2'][:,:,i]    
            
    print ('Done.')

h5fr1.close()
h5fr2.close()

这篇关于添加存储在HDF5数据集中的大矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆