优化HDF5数据集的读写速度 [英] Optimising HDF5 dataset for Read/Write speed

查看:1204
本文介绍了优化HDF5数据集的读写速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在运行一个实验,该实验是在空间上扫描目标并在每个离散像素处获取示波器轨迹.通常我的走线长度为200Kpts.扫描完整个目标之后,我会在空间上组合这些时域信号,并从本质上播放已扫描内容的电影.我的扫描区域大小为330x220像素,因此整个数据集都比我必须使用的计算机上的RAM大.

I'm currently running an experiment where I scan a target spatially and grab an oscilloscope trace at each discrete pixel. Generally my trace lengths are 200Kpts. After scanning the entire target I assemble these time domain signals spatially and essentially play back a movie of what was scanned. My scan area is 330x220 pixels in size so the entire dataset is larger than RAM on the computer I have to use.

首先,我只是将每个示波器迹线保存为一个numpy数组,然后在我的扫描完成下采样/过滤等之后,然后以不会遇到内存问题的方式将影片拼接在一起.但是,我现在无法进行下采样,因为会发生混叠,因此需要访问原始数据.

To start with I was just saving each oscilloscope trace as a numpy array and then after my scan completed downsampling/filtering etc and then piecing the movie together in a way that didn't run into memory problems. However, I'm now at a point where I cant downsample as aliasing will occur and thus need to access the raw data.

我已经开始考虑使用H5py将大型3D数据块存储在HDF5数据集中.我的主要问题是我的块大小分配.我输入的数据与要在其中读取数据的平面正交.据我所知,我写数据的主要选择是:

I've started looking into storing my large 3d data block in an HDF5 dataset using H5py. My main issue is with my chunk size allocation. My incoming data is orthogonal to the plane that i'd like to read it out in. My main options (to my knowledge) of writing my data are:

    #Fast write Slow read
    with h5py.File("test_h5py.hdf5","a") as f:
        dset = f.create_dataset("uncompchunk",(height,width,dataLen),chunks = (1,1,dataLen), dtype = 'f')
        for i in range(height):
            for j in range(width):
                dset[i,j,:] = np.random.random(200000)

    #Slow write Fast read
    with h5py.File("test_h5py.hdf5","a") as f:
        dset = f.create_dataset("uncompchunk",(height,width,dataLen),chunks = (height,width,1), dtype = 'f')
        for i in range(height):
            for j in range(width):
                dset[i,j,:] = np.random.random(200000)     

我是否可以通过某种方式优化这两种情况,以使两种方法都运行起来效率极低?

Is there some way I can optimize the two cases so that neither is horribly inefficient to run?

推荐答案

您的代码中存在一些性能陷阱.

You have some performance pitfalls in your code.

  1. 您正在行中使用某种形式的索引(读/写HDF5-Dataset时不要更改数组变暗的数量.
  2. 如果您不读取或写入整个块,请设置适当的块缓存大小. https://stackoverflow.com/a/42966070/4045774

  1. You are using some sort of fancy indexing in the line (don't change the number of array dims when reading/ writing to a HDF5-Dataset.
  2. Set up a proper chunk-cache size, if you are not reading or writing whole chunks. https://stackoverflow.com/a/42966070/4045774

减少对HDF5-Api的读取或写入调用.

Reduce the amount of read or write calls to the HDF5- Api.

以下示例使用HDF5-API进行缓存.要设置适当的缓存大小,我将使用h5py_cache. https://pypi.python.org/pypi/h5py-cache/1.0. 1

The following example uses caching by the HDF5-API. To set up a proper cache size I will use h5py_cache. https://pypi.python.org/pypi/h5py-cache/1.0.1

如果自己进行缓存,则可以进一步提高性能. (读取和写入整个块)

You could further improve the performance if you do the caching yourself. (read and write whole chunks)

写作

# minimal chache size for reasonable performance would be 20*20*dataLen*4= 320 MB, lets take a bit more
with h5py_cache.File(h5pyfile, 'r+',chunk_cache_mem_size=500*1024**2) as f:
    dset = f.create_dataset("uncompchunk",(height,width,dataLen),chunks = (20,20,20), dtype = 'f')
    for i in range(height):
        for j in range(width):
            # avoid fancy slicing
            dset[i:i+1,j:j+1,:] = expand_dims(expand_dims(np.random.random(200000),axis=0),axis=0)

阅读

# minimal chache size for reasonable performance would be height*width*500*4= 145 MB, lets take a bit more
with h5py_cache.File(h5pyfile, 'r+',chunk_cache_mem_size=200*1024**2) as f:
     dset=f["uncompchunk"]
     for i in xrange(0,dataLen):
         Image=np.squeeze(dset[:,:,i:i+1])

这篇关于优化HDF5数据集的读写速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆