从多个hdf5组创建数据集 [英] Creating a dataset from multiple hdf5 groups

查看:48
本文介绍了从多个hdf5组创建数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从多个hdf5组创建数据集

creating a dataset from multiple hdf5 groups

具有以下功能的组的代码

Code for groups with

np.array(hdf.get('all my groups'))

然后我添加了用于从组创建数据集的代码.

I have then added code for creating a dataset from groups.

with h5py.File('/train.h5', 'w') as hdf:
hdf.create_dataset('train', data=one_T+two_T+three_T+four_T+five_T)

错误消息是

ValueError:操作数不能与形状一起广播(534456,4)(534456,14)

每个组中的数字相同,只是列长度不同.5个单独的组到一个数据集.

The numbers in each group are the same other than the varying column lengths. 5 separate groups to one dataset.

推荐答案

此答案在第一个答案的注释中满足了OP的请求("示例为ds_1所有列,ds_2前两列,ds_3列4和6,ds_4所有列").该过程非常相似,但是输入稍微更复杂".比第一个答案.结果,我使用了另一种方法来定义要复制的数据集名称和列.差异:

This answer addresses the OP's request in comments to my first answer ("an example would be ds_1 all columns, ds_2 first two columns, ds_3 column 4 and 6, ds_4 all columns"). The process is very similar, but the input is "slightly more complicated" than the first answer. As a result I used a different approach to define dataset names and colums to be copied. Differences:

  • 第一个解决方案从"keys()"迭代数据集名称.(完全复制每个数据集,并将其追加到新文件中的数据集之后).新数据集的大小是通过将所有数据集的大小相加得出的.
  • 第二个解决方案使用2个列表来定义1)数据集名称( ds_list )和2)关联列以从每个数据集复制( col_list 是列表的a).通过汇总 col_list 中的列数来计算新数据集的大小.我使用了花式索引"使用 col_list 提取列.
  • 您如何决定执行此操作取决于您的数据.
  • 注意:为简单起见,我删除了dtype和shape测试.您应包括这些内容,以免出现现实世界"错误.问题.
  • The first solution iterates over the dataset names from the "keys()" (copying each dataset completely, appending to a dataset in the new file). The size of the new dataset is calculated by summing sizes of all datasets.
  • The second solution uses 2 lists to define 1) dataset names (ds_list) and 2) associated columns to copy from each dataset (col_list is a of lists). The size of the new dataset is calculated by summing the number of columns in col_list. I used "fancy indexing" to extract the columns using col_list.
  • How you decide to do this depends on your data.
  • Note: for simplicity, I deleted the dtype and shape tests. You should include these to avoid errors with "real world" problems.

下面的代码:

# Data for file1
arr1 = np.random.random(120).reshape(20,6)
arr2 = np.random.random(120).reshape(20,6)
arr3 = np.random.random(120).reshape(20,6)
arr4 = np.random.random(120).reshape(20,6)

# Create file1 with 4 datasets
with h5py.File('file1.h5','w') as h5f :
    h5f.create_dataset('ds_1',data=arr1)
    h5f.create_dataset('ds_2',data=arr2)
    h5f.create_dataset('ds_3',data=arr3)
    h5f.create_dataset('ds_4',data=arr4)
 
# Open file1 for reading and file2 for writing
with h5py.File('file1.h5','r') as h5f1 , \
     h5py.File('file2.h5','w') as h5f2 :

# Loop over datasets in file1 to get dtype and rows (should test compatibility)        
     for i, ds in enumerate(h5f1.keys()) :
        if i == 0:
            ds_0_dtype = h5f1[ds].dtype
            n_rows = h5f1[ds].shape[0]
            break

# Create new empty dataset with appropriate dtype and size
# Use maxshape parameter to make resizable in the future

    ds_list = ['ds_1','ds_2','ds_3','ds_4']
    col_list =[ [0,1,2,3,4,5], [0,1], [3,5], [0,1,2,3,4,5] ]
    n_cols = sum( [ len(c) for c in col_list])
    h5f2.create_dataset('combined', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
    
# Loop over datasets in file1, read data into xfer_arr, and write to file2        
    first = 0  
    for ds, cols in zip(ds_list, col_list) :
        xfer_arr = h5f1[ds][:,cols]
        last = first + xfer_arr.shape[1]
        h5f2['combined'][:, first:last] = xfer_arr[:]
        first = last

这篇关于从多个hdf5组创建数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆