用于读取行的最佳HDF5数据集块形状 [英] Optimal HDF5 dataset chunk shape for reading rows

查看:101
本文介绍了用于读取行的最佳HDF5数据集块形状的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个合理的大小(压缩后的18GB)HDF5数据集,并希望优化读取行的速度。形状为(639038,10000)。我将多次读取整个数据集中的某些行(例如〜1000行)。所以我不能使用x:(x + 1000)来切片行。

I have a reasonable size (18GB compressed) HDF5 dataset and am looking to optimize reading rows for speed. Shape is (639038, 10000). I will be reading a selection of rows (say ~1000 rows) many times, located across the dataset. So I can't use x:(x+1000) to slice rows.

使用h5py从内存不足的HDF5中读取行已经很慢,因为我必须通过一个排序列表并求助于花式索引。有没有办法避免花哨的索引编制,或者我可以使用更好的块​​形状/大小?

Reading rows from out-of-memory HDF5 is already slow using h5py since I have to pass a sorted list and resort to fancy indexing. Is there a way to avoid fancy indexing, or is there a better chunk shape/size I can use?

我已经阅读了一些经验法则,例如1MB-10MB的块大小并选择与我正在阅读的内容一致的形状。但是,构建大量具有不同块形状的HDF5文件进行测试在计算上非常昂贵且非常缓慢。

I have read rules of thumb such as 1MB-10MB chunk sizes and choosing shape consistent what I'm reading. However, building a large number of HDF5 files with different chunk shapes for testing is computationally expensive and very slow.

对于每个〜1,000行的选择,我立即将它们汇总为得到一个长度为10,000的数组。我当前的数据集如下所示:

For each selection of ~1,000 rows, I immediately sum them to get an array of length 10,000. My current dataset looks like this:

'10000': {'chunks': (64, 1000),
          'compression': 'lzf',
          'compression_opts': None,
          'dtype': dtype('float32'),
          'fillvalue': 0.0,
          'maxshape': (None, 10000),
          'shape': (639038, 10000),
          'shuffle': False,
          'size': 2095412704}

我已经尝试过的东西:


  • 重写数据块形状(128、10000)(据我估计约为5MB)太慢了。

  • 我看着dask.array进行了优化,但由于〜1,000行适合内存我看不到任何好处。

推荐答案

查找正确的块缓存大小

首先,我想讨论一些一般性的事情。
知道每个块只能整体读取或写入非常重要。 h5py的标准块缓存大小可以避免过多的磁盘I / O,默认情况下仅为1 MB,并且在许多情况下应该增加该大小,稍后将对此进行讨论。

At first I want to discuss some general things. It is very important to know that each individual chunk could only be read or written as a whole. The standard chunk-cache size of h5py which can avoid excessive disk I/Os is only one MB per default and should in many cases be increased, which will be discussed later on.

例如:


  • 我们有一个形状为(639038,10000),float32(未压缩的25.5 GB)的dset

  • 我们想将数据列明智地写为 dset [:,i] = arr 并按行明智地阅读它 arr = dset [i,:]

  • 我们为此类型的工作选择了完全错误的块形状,即(1,10000)

  • We have a dset with shape (639038, 10000), float32 (25,5 GB uncompressed)
  • we want to write our data column wise dset[:,i]=arr and read it row wise arr=dset[i,:]
  • we choose a completely wrong chunk-shape for this type of work ie (1,10000)

在这种情况下,读取速度不会很差(尽管块大小有点小),因为我们只读取正在使用的数据。但是,当我们在该数据集上书写时会发生什么呢?如果我们访问列,则会写入每个块的一个浮点数。这意味着我们实际上每次迭代都会写入整个数据集(25.5 GB),并每隔一段时间读取一次整个数据集。这是因为如果您修改了一个块,则必须先读取它,如果它没有被缓存(我假设这里的块缓存大小低于25.5 GB)。

In this case reading speed won't be to bad (although the chunk size is a little small) because we read only the data we are using. But what happens when we write on that dataset? If we access a column one floating point number of each chunk is written. This means we are actually writing the whole dataset (25,5 GB) with every iteration and read the whole dataset every other time. This is because if you modify a chunk, you have to read it first if it is not cached (I assume a chunk-cache-size below 25,5 GB here).

那么我们在这里可以改善什么?
在这种情况下,我们必须在写/读速度和块缓存使用的内存之间做出折衷。

So what can we improve here? In such a case we have to make a compromise between write/read speed and the memory which is used by the chunk-cache.

一个假设给出体面的/读写速度:

An assumption which will give both decent/read and write speed:


  • 我们选择块大小为(100,1000)

  • 如果要遍历第一维,则至少需要(1000 * 639038 * 4-> 2,55 GB)高速缓存,以避免如上所述的额外IO开销和(100 * 10000 * 4-> 0 ,4 MB)。

  • 因此,在此示例中,我们至少应提供2.6 GB的块数据缓存。

结论
通常没有正确的块大小或形状,这在很大程度上取决于使用哪个任务。切勿在不考虑块缓存的情况下选择块的大小或形状。就随机读取/写入而言,RAM比最快的SSD快得多。

Conclusion There is no generally right chunk size or shape, it depends heavily on the task which one to use. Never choose your chunk size or shape without making some minds about the chunk-cache. RAM is orders of magnite faster than the fastest SSD in regards of random read/write.

关于您的问题
读取随机行,不正确的块高速缓存大小是您的真正问题。

Regarding your problem I would simply read the random rows, the improper chunk-cache-size is your real problem.

将以下代码的性能与您的版本进行比较:

Compare the performance of the following code with your version:

import h5py as h5
import time
import numpy as np

def ReadingAndWriting():
    File_Name_HDF5='Test.h5'

    #shape = (639038, 10000)
    shape = (639038, 1000)
    chunk_shape=(100, 1000)
    Array=np.array(np.random.rand(shape[0]),np.float32)

    #We are using 4GB of chunk_cache_mem here ("rdcc_nbytes")
    f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
    d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

    #Writing columns
    t1=time.time()
    for i in range(0,shape[1]):
        d[:,i:i+1]=np.expand_dims(Array, 1)

    f.close()
    print(time.time()-t1)

    # Reading random rows
    # If we read one row there are actually 100 read, but if we access a row
    # which is already in cache we would see a huge speed up.
    f = h5.File(File_Name_HDF5,'r',rdcc_nbytes=1024**2*4000,rdcc_nslots=1e7)
    d = f["Test"]
    for j in range(0,639):
        t1=time.time()
        # With more iterations it will be more likely that we hit a already cached row
        inds=np.random.randint(0, high=shape[0]-1, size=1000)
        for i in range(0,inds.shape[0]):
            Array=np.copy(d[inds[i],:])
        print(time.time()-t1)
    f.close()

最简单的花式切片方式

我在评论中写道,我在最近的版本中看不到这种行为。我错了。比较以下内容:

I wrote in the comments, that I couldn't see this behavior in recent versions. I was wrong. Compare the following:

def Writing():
File_Name_HDF5 ='Test.h5'

def Writing(): File_Name_HDF5='Test.h5'

#shape = (639038, 10000)
shape = (639038, 1000)
chunk_shape=(100, 1000)
Array=np.array(np.random.rand(shape[0]),np.float32)

# Writing_1 normal indexing
###########################################
f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*4000)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

t1=time.time()
for i in range(shape[1]):
    d[:,i:i+1]=np.expand_dims(Array, 1)

f.close()
print(time.time()-t1)

# Writing_2 simplest form of fancy indexing
###########################################
f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

#Writing columns
t1=time.time()
for i in range(shape[1]):
    d[:,i]=Array

f.close()
print(time.time()-t1)

这使我的硬盘在第一个版本中使用34秒,在第二个版本中使用78秒。

This gives on my HDD 34 seconds for the first version and 78 seconds for the second version.

这篇关于用于读取行的最佳HDF5数据集块形状的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆