使用h5py合并所有h5文件 [英] Merge all h5 files using h5py

查看:568
本文介绍了使用h5py合并所有h5文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是编码方面的新手.有人可以使用h5py编写Python脚本吗?我们可以读取所有目录和子目录,以将多个h5文件合并为一个h5文件.

I am Novice at coding. Can some one help with a script in Python using h5py wherein we can read all the directories and sub-directories to merge multiple h5 files into a single h5 file.

推荐答案

您需要的是文件中所有数据集的列表.我认为递归函数的概念是这里所需要的.这将允许您从一个组中提取所有数据集",但是当其中一个似乎是一个组本身时,递归地执行相同的操作,直到找到所有数据集为止.例如:

What you need is a list of all datasets in the file. I think that the notion of a recursive function is what is needed here. This would allow you to extract all 'datasets' from a group, but when one of them appears to be a group itself, recursively do the same thing until all datasets are found. For example:

/
|- dataset1
|- group1
   |- dataset2
   |- dataset3
|- dataset4

您的函数的伪代码应类似于:

Your function should in pseudo-code look like:

def getdatasets(key, file):

  out = []

  for name in file[key]:

    path = join(key, name)

    if file[path] is dataset: out += [path]
    else                      out += getdatasets(path, file)

  return out

对于我们的示例:

  1. /dataset1是数据集:为输出添加路径,给出

  1. /dataset1 is a dataset: add path to output, giving

out = ['/dataset1']

  • /group不是数据集:调用getdatasets('/group',file)

  • /group is not a dataset: call getdatasets('/group',file)

    1. /group/dataset2是数据集:为输出添加路径,给出

    1. /group/dataset2 is a dataset: add path to output, giving

    nested_out = ['/group/dataset2']
    

  • /group/dataset3是数据集:为输出添加路径,给出

  • /group/dataset3 is a dataset: add path to output, giving

    nested_out = ['/group/dataset2', '/group/dataset3']
    

  • 这已添加到我们已经拥有的内容中:

    This is added to what we already had:

    out = ['/dataset1', '/group/dataset2', '/group/dataset3']
    

  • /dataset4是数据集:为输出添加路径,给出

  • /dataset4 is a dataset: add path to output, giving

    out = ['/dataset1', '/group/dataset2', '/group/dataset3', '/dataset4']
    

  • 此列表可用于将所有数据复制到另一个文件.

    This list can be used to copy all data to another file.

    要进行简单的克隆,您可以执行以下操作.

    To make a simple clone you could do the following.

    import h5py
    import numpy as np
    
    # function to return a list of paths to each dataset
    def getdatasets(key,archive):
    
      if key[-1] != '/': key += '/'
    
      out = []
    
      for name in archive[key]:
    
        path = key + name
    
        if isinstance(archive[path], h5py.Dataset):
          out += [path]
        else:
           out += getdatasets(path,archive)
    
      return out
    
    
    # open HDF5-files
    data     = h5py.File('old.hdf5','r')
    new_data = h5py.File('new.hdf5','w')
    
    # read as much datasets as possible from the old HDF5-file
    datasets = getdatasets('/',data)
    
    # get the group-names from the lists of datasets
    groups = list(set([i[::-1].split('/',1)[1][::-1] for i in datasets]))
    groups = [i for i in groups if len(i)>0]
    
    # sort groups based on depth
    idx    = np.argsort(np.array([len(i.split('/')) for i in groups]))
    groups = [groups[i] for i in idx]
    
    # create all groups that contain dataset that will be copied
    for group in groups:
      new_data.create_group(group)
    
    # copy datasets
    for path in datasets:
    
      # - get group name
      group = path[::-1].split('/',1)[1][::-1]
    
      # - minimum group name
      if len(group) == 0: group = '/'
    
      # - copy data
      data.copy(path, new_data[group])
    

    当然,根据您的需要,还可以进行其他自定义.您描述文件的某种组合.在这种情况下,您将不得不

    Further customizations are, of course, possible depending on what you want. You describe some combination of files. In that case you would have to

     new_data = h5py.File('new.hdf5','a')
    

    并可能在路径中添加一些内容.

    and probably add something to the path.

    这篇关于使用h5py合并所有h5文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆