如何合并多个.h5文件? [英] How can I combine multiple .h5 file?

查看:2038
本文介绍了如何合并多个.h5文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在线提供的所有内容都太复杂了.对于部分导出的数据库,我的数据库很大.我现在有三个.h5文件,我想将它们合并为一个.h5文件,以进行进一步的工作.我该怎么办?

Everything that is available online is too complicated. My database is large to I exported it in parts. I now have three .h5 file and I would like to combine them into one .h5 file for further work. How can I do it?

推荐答案

这些示例展示了如何使用 h5py 在2个HDF5文件之间复制数据集.有关 PyTables 示例,请参见我的其他答案.我创建了一些简单的HDF5文件来模拟CSV类型的数据(所有浮点数,但是如果您使用的是混合数据类型,则过程相同).根据您的描述,每个文件只有一个数据集.当您有多个数据集时,可以在h5py中使用visititems()扩展此过程.

These examples show how to use h5py to copy datasets between 2 HDF5 files. See my other answer for PyTables examples. I created some simple HDF5 files to mimic CSV type data (all floats, but the process is the same if you have mixed data types). Based on your description, each file only has one dataset. When you have multiple datasets, you can extend this process with visititems() in h5py.

请注意:创建示例中使用的HDF5文件的代码位于最后.

Note: code to create the HDF5 files used in the examples is at the end.

所有方法都使用glob()查找以下操作中使用的HDF5文件.

All methods use glob() to find the HDF5 files used in the operations below.

方法1:创建外部链接
这将在新的HDF5文件中产生3个组,每个组都有一个指向原始数据的外部链接. 这不会复制数据,而是通过1个文件中的链接提供对所有文件中数据的访问.

Method 1: Create External Links
This results in 3 Groups in the new HDF5 file, each with an external link to the original data. This does not copy the data, but provides access to the data in all files via the links in 1 file.

with h5py.File('table_links.h5',mode='w') as h5fw:
    link_cnt = 0 
    for h5name in glob.glob('file*.h5'):
        link_cnt += 1
        h5fw['link'+str(link_cnt)] = h5py.ExternalLink(h5name,'/')   

方法2a:按原样"复制数据
(2020年5月26日更新:此方法对所有数据集使用.copy()方法.)
这将使用原始数据集名称将数据从原始文件中的每个数据集复制到新文件.它循环复制所有根级别的数据集. 这要求每个文件中的数据集具有不同的名称.数据不会合并到一个数据集中.

Method 2a: Copy Data 'as-is'
(26-May-2020 update: This uses the .copy() method for all datasets.)
This copies the data from each dataset in the original file to the new file using the original dataset names. It loops to copy ALL root level datasets. This requires datasets in each file to have different names. The data is not merged into one dataset.

with h5py.File('table_copy.h5',mode='w') as h5fw:
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        for obj in h5fr.keys():        
            h5r.copy(obj, h5fw)       

方法2b:按原样"复制数据
(这是我最初的答案,在我不了解.copy()方法之前.)
这将使用原始数据集名称将数据从原始文件中的每个数据集复制到新文件. 这要求每个文件中的数据集具有不同的名称.数据不会合并到一个数据集中.

Method 2b: Copy Data 'as-is'
(This was my original answer, before I knew about the .copy() method.)
This copies the data from each dataset in the original file to the new file using the original dataset name. This requires datasets in each file to have different names. The data is not merged into one dataset.

with h5py.File('table_copy.h5',mode='w') as h5fw:
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        dset1 = list(h5fr.keys())[0]
        arr_data = h5fr[dset1][:]
        h5fw.create_dataset(dset1,data=arr_data)   

方法3a:将所有数据合并到1个固定大小的数据集中
这会将原始文件中每个数据集的数据复制并合并到新文件中的单个数据集中. 在此示例中,数据集名称没有限制.另外,我最初创建了一个大型数据集,并且不调整大小.假定有足够的行来保存所有合并的数据.测试应在生产工作中添加.

Method 3a: Merge all data into 1 Fixed size Dataset
This copies and merges the data from each dataset in the original file into a single dataset in the new file. In this example there are no restrictions on the dataset names. Also, I initially create a large dataset and don't resize. This assumes there are enough rows to hold all merged data. Tests should be added in production work.

with h5py.File('table_merge.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        dset1 = list(h5fr.keys())[0]
        arr_data = h5fr[dset1][:]
        h5fw.require_dataset('alldata', dtype="f",  shape=(50,5), maxshape=(100, 5) )
        h5fw['alldata'][row1:row1+arr_data.shape[0],:] = arr_data[:]
        row1 += arr_data.shape[0]

方法3b:将所有数据合并到1个可调整大小的数据集中
这类似于上面的方法.但是,我创建了一个可调整大小的数据集,并根据读取和添加的数据量进行了放大.

Method 3b: Merge all data into 1 Resizeable Dataset
This is similar to method above. However, I create a resizeable dataset and enlarge based on the amount of data that is read and added.

with h5py.File('table_merge.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        dset1 = list(h5fr.keys())[0]
        arr_data = h5fr[dset1][:]
        dslen = arr_data.shape[0]
        cols = arr_data.shape[1]
        if row1 == 0: 
            h5fw.create_dataset('alldata', dtype="f",  shape=(dslen,cols), maxshape=(None, cols) )
        if row1+dslen <= len(h5fw['alldata']) :
            h5fw['alldata'][row1:row1+dslen,:] = arr_data[:]
        else :
            h5fw['alldata'].resize( (row1+dslen, cols) )
            h5fw['alldata'][row1:row1+dslen,:] = arr_data[:]
        row1 += dslen

要创建源文件,请阅读上文:

for fcnt in range(1,4,1):
    fname = 'file' + str(fcnt) + '.h5'
    arr = np.random.random(50).reshape(10,5)
    with h5py.File(fname,'w') as h5fw :
        h5fw.create_dataset('data_'+str(fcnt),data=arr)

这篇关于如何合并多个.h5文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆