h5py写入：如何高效地将数百万个.npy数组写入.hdf5文件？ [英] h5py writing: How to efficiently write millions of .npy arrays to a .hdf5 file?

查看：41 发布时间：2022/3/1 18:08:56 python numpy bigdata hdf5 h5py

本文介绍了h5py写入：如何高效地将数百万个.npy数组写入.hdf5文件？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我必须将大图像的子样本存储为.npy大小为(20，20，5)的数组。为了在训练分类模型时统一采样，我正在寻找一种有效的方法来存储近1000万个子样本，这样就可以做到这一点。

如果我将它们存储为整个图像，则训练期间的采样不能代表分布。我有存储空间，但尝试存储那么多小文件时会用完inode。h5py/写入hdf5文件是解决我问题的自然答案，但是这个过程非常慢。运行一个程序一天半的时间不足以编写所有子样本。我是h5py的新手，我想知道这是不是写得太多的原因。

如果是这样的话，我不知道如何正确地进行分组，以避免不均匀抽样的问题。每个图像具有不同数量的子样本(例如，一个图像可以是(20000，20，20，5)，而另一个图像可以是(32123，20，20，5)。

这是我用来将每个示例写入.hdf5：

的代码

#define possible groups
groups=['training_samples','validation_samples','test_samples']

f = h5py.File('~/.../TrainingData_.hdf5', 'a', libver='latest')

此时，我运行子采样函数，该函数返回大小为(x，20，20，5)的NumPy数组trarray。

然后：

label = np.array([1])
for i in range(trarray.shape[0]):
   group_choice = random.choices(groups, weights = [65, 15, 20])
   subarr = trarray[i,:,:,:]

   if group_choice[0] == 'training_samples':
       training_samples.create_dataset('ID-{}'.format(indx), data=subarr)
       training_labels.create_dataset('ID-{}'.format(indx), data=label)
       indx += 1
   elif group_choice[0] =='validation_samples':
       validation_samples.create_dataset('ID-{}'.format(indx), data=subarr)
       validation_labels.create_dataset('ID-{}'.format(indx), data=label)
       indx += 1
   else:
       test_samples.create_dataset('ID-{}'.format(indx), data=subarr)
       test_labels.create_dataset('ID-{}'.format(indx), data=label)
       indx += 1

我可以做些什么来改进这一点/我所做的关于使用h5py是否存在根本错误？

推荐答案

03-22-2021年：请参阅有关下面说明的属性的更新。
这是一个有趣的用例。我对上一个问题的回答涉及到这个问题(在我对这个问题的第一个回答中提到了这个问题)。显然，写入大量小对象时的开销大于实际写入过程。我很好奇，所以我创建了一个原型来探索写入数据的不同流程。

我的起始场景：

我创建了一个形状为(NN，20，20，5)的随机整数的NumPy数组。
然后我按照您的逻辑一次对一行进行切片，并分配为培训、验证或测试样本。
我将切片作为新数据集写入适当的组中。
我向组添加了属性以引用每个数据集的切片编号。

主要调查结果：

将每个数组切片写入新数据集的时间在整个过程中保持相对恒定。
但是写入时间随着属性数(NN)的增加呈指数增长。这在我最初的时候没有被理解。发帖。对于NN(<；2,000)的小值，添加属性是比较快。

每1,000个片(无属性和有属性)的增量写入时间表。(总时间乘以NN/1000。)

切片	时间(秒)	时间(秒)
计数	(不带属性)	(带属性)
1_000	0.34	2.4
2_000	0.34	12.7
5_000	0.33	111.7
10_000	0.34	1783.3
20_000	0.35	n/a

显然，使用属性不是保存切片索引的有效方式。相反，我将捕获作为数据集名称的一部分。这在下面的&Quot；原始&Quot；代码中显示。添加属性的代码包括在感兴趣的情况下。

我创建了一个新流程，首先执行所有切片，然后分3步编写所有数据(训练、验证和测试样本各1步)。由于您无法从数据集名称中获取切片索引，因此我测试了两种不同的方法来保存该数据：1)作为每个"；sample"；数据集的第二个&q；索引&q；数据集，以及2)作为组属性。这两种方法都要快得多。将索引作为索引数据集写入几乎不会影响性能。将它们作为属性写入要慢得多。数据：

所有切片(无属性和有属性)的总写入时间表。

切片	时间(秒)	时间(秒)	时间(秒)
计数	(无索引)	(索引数据集)	(带属性)
10_000	0.43	0.57	141.05
20_000	1.17	1.27	n/a

该方法看起来是一种很有前途的方法，可以在合理的时间内对数据进行切片并将其写入HDF5。您必须使用索引表示法。

场景启动代码：

#define possible groups
groups=['training','validation','test']

# one image may be (20000,20,20,5)
trarray = np.random.randint(1,255, (20_000,20,20,5) )
label = np.array([1])    

with h5py.File('TrainingData_orig.hdf5', 'w') as h5f :
#At this point I run a sub-sampling function that returns a NumPy array, 
#trarray, of size (x,20,20,5).
    for group in groups:
        h5f.create_group(group+'_samples')   
        h5f.create_group(group+'_labels')  
    
    time0 = timeit.default_timer()
    for i in range(trarray.shape[0]):
        group_choice = random.choices(groups, weights = [65, 15, 20])    

        h5f[group_choice[0]+'_samples'].create_dataset(f'ID-{i:04}', data=trarray[i,:,:,:])
        #h5f[group_choice[0]+'_labels'].create_dataset(f'ID-{i:04}', data=label)
        #h5f[group_choice[0]+'_samples'].attrs[f'ID-{i:04}'] = label

        if (i+1) % 1000 == 0:
            exe_time = timeit.default_timer() - time0          
            print(f'incremental time to write {i+1} datasets = {exe_time:.2f} secs')           
            time0 = timeit.default_timer()

测试场景代码：
注意：将属性写入组的调用将被注释掉。

#define possible groups
groups=['training_samples','validation_samples','test_samples']

# one image may be (20000,20,20,5)
trarray = np.random.randint(1,255, (20_000,20,20,5) )
training   = np.empty(trarray.shape,dtype=np.int32)
validation = np.empty(trarray.shape,dtype=np.int32)
test       = np.empty(trarray.shape,dtype=np.int32)

indx1, indx2, indx3 = 0, 0, 0
training_list = []
validation_list = []
test_list = []

training_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)
validation_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)
test_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)

start = timeit.default_timer()

#At this point I run a sub-sampling function that returns a NumPy array, 
#trarray, of size (x,20,20,5).
for i in range(trarray.shape[0]):
    group_choice = random.choices(groups, weights = [65, 15, 20])   
    if group_choice[0] == 'training_samples':
        training[indx1,:,:,:] = data=trarray[i,:,:,:]
        training_list.append( (f'ID-{indx1:04}', i) )
        training_idx[indx1,:]= [indx1,i]
        indx1 += 1
    elif group_choice[0] == 'validation_samples':
        validation[indx2,:,:,:] = data=trarray[i,:,:,:]
        validation_list.append( (f'ID-{indx2:04}', i) )
        validation_idx[indx2,:]= [indx2,i]
        indx2 += 1
    else:
        test[indx3,:,:,:] = data=trarray[i,:,:,:]
        test_list.append( (f'ID-{indx3:04}', i) )
        test_idx[indx3,:]= [indx3,i]
        indx3 += 1


with h5py.File('TrainingData1_.hdf5', 'w') as h5f :
    
    h5f.create_group('training')
    h5f['training'].create_dataset('training_samples', data=training[0:indx1,:,:,:])
    h5f['training'].create_dataset('training_indices', data=training_idx[0:indx1,:])
    # for label, idx in training_list:
    #     h5f['training']['training_samples'].attrs[label] = idx

    h5f.create_group('validation')
    h5f['validation'].create_dataset('validation_samples', data=validation[0:indx2,:,:,:])
    h5f['validation'].create_dataset('validation_indices', data=validation_idx[0:indx2,:])
    # for label, idx in validation_list:
    #     h5f['validation']['validation_samples'].attrs[label] = idx

    h5f.create_group('test')
    h5f['test'].create_dataset('test_samples', data=test[0:indx3,:,:,:])
    h5f['test'].create_dataset('test_indices', data=test_idx[0:indx3,:])
    # for label, idx in test_list:
    #     h5f['test']['test_samples'].attrs[label] = idx

exe_time = timeit.default_timer() - start          
print(f'Write time for {trarray.shape[0]} images slices = {exe_time:.2f} secs')

这篇关于h5py写入：如何高效地将数百万个.npy数组写入.hdf5文件？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

h5py写入：如何高效地将数百万个.npy数组写入.hdf5文件？ [英] h5py writing: How to efficiently write millions of .npy arrays to a .hdf5 file?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

h5py写入：如何高效地将数百万个.npy数组写入.hdf5文件？ [英] h5py writing: How to efficiently write millions of .npy arrays to a .hdf5 file?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭