使用h5py随机播放HDF5数据集 [英] Shuffle HDF5 dataset using h5py

查看:80
本文介绍了使用h5py随机播放HDF5数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的HDF5文件(〜30GB),我需要对每个数据集中的条目(沿0轴)进行洗牌.浏览h5py文档时,我找不到randomAccessshuffle功能,但是我希望自己错过了一些东西.

I have a large HDF5 file (~30GB) and I need to shuffle the entries (along the 0 axis) in each dataset. Looking through the h5py docs I wasn't able to find either randomAccess or shuffle functionality, but I'm hoping that I've missed something.

是否有足够熟悉HDF5的人想出一种随机洗牌数据的快速方法?

Is anyone familiar enough with HDF5 to think of a fast way to random shuffle the data?

以下是我将在有限知识下实现的伪代码:

Here is pseudocode of what I would implement with my limited knowledge:

for dataset in datasets:
    unshuffled = range(dataset.dims[0])
    while unshuffled.length != 0:
        if unshuffled.length <= 100:
            dataset[:unshuffled.length/2], dataset[unshuffled.length/2:] = dataset[unshuffled.length/2:], dataset[:unshuffled.length/2]
            break
        else:
            randomIndex1 = rand(unshuffled.length - 100)
            randomIndex2 = rand(unshuffled.length - 100)

            unshuffled.removeRange(randomIndex1..<randomIndex1+100)
            unshuffled.removeRange(randomIndex2..<randomIndex2+100)

            dataset[randomIndex1:randomIndex1 + 100], dataset[randomIndex2:randomIndex2 + 100] = dataset[randomIndex2:randomIndex2 + 100], dataset[randomIndex1:randomIndex1 + 100]

推荐答案

您可以使用random.shuffle(dataset).对于笔记本电脑上具有Core i5处理器,8 GB RAM和256 GB SSD的30 GB数据集,这花费了11分钟多一点的时间.请参阅以下内容:

You can use random.shuffle(dataset). This takes a little more than 11 minutes for a 30 GB dataset on my laptop with a Core i5 processor, 8 GB of RAM, and a 256 GB SSD. See the following:

>>> import os
>>> import random
>>> import time
>>> import h5py
>>> import numpy as np
>>>
>>> h5f = h5py.File('example.h5', 'w')
>>> h5f.create_dataset('example', (40000, 256, 256, 3), dtype='float32')
>>> # set all values of each instance equal to its index
... for i, instance in enumerate(h5f['example']):
...     h5f['example'][i, ...] = \
...             np.ones(instance.shape, dtype='float32') * i
...
>>> # get file size in bytes
... file_size = os.path.getsize('example.h5')
>>> print('Size of example.h5: {:.3f} GB'.format(file_size/2.0**30))
Size of example.h5: 29.297 GB
>>> def shuffle_time():
...     t1 = time.time()
...     random.shuffle(h5f['example'])
...     t2 = time.time()
...     print('Time to shuffle: {:.3f} seconds'.format(str(t2 - t1)))
...
>>> print('Value of first 5 instances:\n{}'
...       ''.format(str(h5f['example'][:10, 0, 0, 0])))
Value of first 5 instances:
[ 0.  1.  2.  3.  4.]
>>> shuffle_time()
Time to shuffle: 673.848 seconds
>>> print('Value of first 5 instances after '
...       'shuffling:\n{}'.format(str(h5f['example'][:10, 0, 0, 0])))
Value of first 5 instances after shuffling:
[ 15733.  28530.   4234. 14869.  10267.]
>>> h5f.close()

改组几个较小的数据集的性能应该不会比这差.

Performance for shuffling several smaller datasets should not be worse than this.

这篇关于使用h5py随机播放HDF5数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆