用于读取行的最佳 HDF5 数据集块形状 [英] Optimal HDF5 dataset chunk shape for reading rows

查看:50
本文介绍了用于读取行的最佳 HDF5 数据集块形状的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个合理大小(18GB 压缩)的 HDF5 数据集,并且希望优化读取行的速度.形状是 (639038, 10000).我将多次读取位于整个数据集的一系列行(比如约 1000 行).所以我不能使用 x:(x+1000) 来切片行.

I have a reasonable size (18GB compressed) HDF5 dataset and am looking to optimize reading rows for speed. Shape is (639038, 10000). I will be reading a selection of rows (say ~1000 rows) many times, located across the dataset. So I can't use x:(x+1000) to slice rows.

使用 h5py 从内存不足的 HDF5 中读取行已经很慢了,因为我必须传递一个排序列表并使用花哨的索引.有没有办法避免花哨的索引,或者我可以使用更好的块​​形状/大小?

Reading rows from out-of-memory HDF5 is already slow using h5py since I have to pass a sorted list and resort to fancy indexing. Is there a way to avoid fancy indexing, or is there a better chunk shape/size I can use?

我已阅读经验法则,例如 1MB-10MB 块大小,并选择与我正在阅读的内容一致的形状.然而,为测试构建大量具有不同块形状的 HDF5 文件的计算成本高昂且速度非常慢.

I have read rules of thumb such as 1MB-10MB chunk sizes and choosing shape consistent what I'm reading. However, building a large number of HDF5 files with different chunk shapes for testing is computationally expensive and very slow.

对于大约 1,000 行的每个选择,我立即将它们相加以获得长度为 10,000 的数组.我当前的数据集如下所示:

For each selection of ~1,000 rows, I immediately sum them to get an array of length 10,000. My current dataset looks like this:

'10000': {'chunks': (64, 1000),
          'compression': 'lzf',
          'compression_opts': None,
          'dtype': dtype('float32'),
          'fillvalue': 0.0,
          'maxshape': (None, 10000),
          'shape': (639038, 10000),
          'shuffle': False,
          'size': 2095412704}

我已经尝试过的:

  • 使用块形状 (128, 10000) 重写数据集,我计算出大约 5MB,速度非常慢.
  • 我查看了 dask.array 以进行优化,但由于大约 1,000 行可以轻松放入内存中,因此我看不到任何好处.

推荐答案

寻找合适的块缓存大小

首先我想讨论一些一般性的事情.知道每个单独的块只能作为一个整体读取或写入非常重要.h5py 的标准块缓存大小可以避免过多的磁盘 I/O,默认仅为 1 MB,并且在许多情况下应该增加,这将在后面讨论.

At first I want to discuss some general things. It is very important to know that each individual chunk could only be read or written as a whole. The standard chunk-cache size of h5py which can avoid excessive disk I/Os is only one MB per default and should in many cases be increased, which will be discussed later on.

举个例子:

  • 我们有一个形状为 (639038, 10000)、float32(25,5 GB 未压缩)的 dset
  • 我们希望按列写入数据 dset[:,i]=arr 并按行读取数据 arr=dset[i,:]
  • 我们为此类工作选择了完全错误的块形状,即 (1,10000)

在这种情况下读取速度不会太差(虽然块大小有点小)因为我们只读取我们正在使用的数据.但是当我们在该数据集上写入时会发生什么?如果我们访问一列,则写入每个块的一个浮点数.这意味着我们实际上在每次迭代时写入整个数据集 (25,5 GB),每隔一次读取整个数据集.这是因为如果你修改了一个块,如果它没有被缓存,你必须先读取它(我在这里假设块缓存大小低于 25.5 GB).

In this case reading speed won't be to bad (although the chunk size is a little small) because we read only the data we are using. But what happens when we write on that dataset? If we access a column one floating point number of each chunk is written. This means we are actually writing the whole dataset (25,5 GB) with every iteration and read the whole dataset every other time. This is because if you modify a chunk, you have to read it first if it is not cached (I assume a chunk-cache-size below 25,5 GB here).

那么我们可以在这里改进什么?在这种情况下,我们必须在写入/读取速度和块缓存使用的内存之间做出妥协.

So what can we improve here? In such a case we have to make a compromise between write/read speed and the memory which is used by the chunk-cache.

一个可以提供不错的/读取和写入速度的假设:

An assumption which will give both decent/read and write speed:

  • 我们选择的块大小为 (100, 1000)
  • 如果我们想迭代第一个维度,我们至少需要 (1000*639038*4 ->2,55 GB) 缓存以避免如上所述的额外 IO 开销和 (100*10000*4 -> 0,4 MB).
  • 因此,在此示例中,我们应该提供至少 2.6 GB 的块数据缓存.

结论没有通常正确的块大小或形状,这在很大程度上取决于要使用的任务.永远不要在不考虑块缓存的情况下选择块大小或形状.RAM 在随机读/写方面比最快的 SSD 快几个数量级.

Conclusion There is no generally right chunk size or shape, it depends heavily on the task which one to use. Never choose your chunk size or shape without making some minds about the chunk-cache. RAM is orders of magnite faster than the fastest SSD in regards of random read/write.

关于您的问题我会简单地读取随机行,不正确的块缓存大小是你真正的问题.

Regarding your problem I would simply read the random rows, the improper chunk-cache-size is your real problem.

将以下代码的性能与您的版本进行比较:

Compare the performance of the following code with your version:

import h5py as h5
import time
import numpy as np

def ReadingAndWriting():
    File_Name_HDF5='Test.h5'

    #shape = (639038, 10000)
    shape = (639038, 1000)
    chunk_shape=(100, 1000)
    Array=np.array(np.random.rand(shape[0]),np.float32)

    #We are using 4GB of chunk_cache_mem here ("rdcc_nbytes")
    f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
    d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

    #Writing columns
    t1=time.time()
    for i in range(0,shape[1]):
        d[:,i:i+1]=np.expand_dims(Array, 1)

    f.close()
    print(time.time()-t1)

    # Reading random rows
    # If we read one row there are actually 100 read, but if we access a row
    # which is already in cache we would see a huge speed up.
    f = h5.File(File_Name_HDF5,'r',rdcc_nbytes=1024**2*4000,rdcc_nslots=1e7)
    d = f["Test"]
    for j in range(0,639):
        t1=time.time()
        # With more iterations it will be more likely that we hit a already cached row
        inds=np.random.randint(0, high=shape[0]-1, size=1000)
        for i in range(0,inds.shape[0]):
            Array=np.copy(d[inds[i],:])
        print(time.time()-t1)
    f.close()

最简单的花式切片

我在评论中写道,在最近的版本中我看不到这种行为.我错了.比较以下内容:

I wrote in the comments, that I couldn't see this behavior in recent versions. I was wrong. Compare the following:

定义写作():File_Name_HDF5='Test.h5'

def Writing(): File_Name_HDF5='Test.h5'

#shape = (639038, 10000)
shape = (639038, 1000)
chunk_shape=(100, 1000)
Array=np.array(np.random.rand(shape[0]),np.float32)

# Writing_1 normal indexing
###########################################
f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*4000)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

t1=time.time()
for i in range(shape[1]):
    d[:,i:i+1]=np.expand_dims(Array, 1)

f.close()
print(time.time()-t1)

# Writing_2 simplest form of fancy indexing
###########################################
f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

#Writing columns
t1=time.time()
for i in range(shape[1]):
    d[:,i]=Array

f.close()
print(time.time()-t1)

这让我的硬盘第一个版本为 34 秒,第二个版本为 78 秒.

This gives on my HDD 34 seconds for the first version and 78 seconds for the second version.

这篇关于用于读取行的最佳 HDF5 数据集块形状的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆