如何加快从压缩的HDF5文件读取的速度 [英] How to speed up reading from compressed HDF5 files

查看：738 发布时间：2020/11/22 1:37:29 python python-3.x hdf5 h5py

本文介绍了如何加快从压缩的HDF5文件读取的速度的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在SSD上存储了几个大的HDF5文件(lzf压缩文件大小为10–15 GB，未压缩大小为20–25 GB).将此类文件中的内容读取到RAM中以进行进一步处理，每个文件大约需要2分钟.在此期间，仅使用了一个内核(但达到了100％).因此，我猜想CPU上运行的减压部分是瓶颈，而不是SSD的IO吞吐量.

I have several big HDF5 file stored on an SSD (lzf compressed file size is 10–15 GB, uncompressed size would be 20–25 GB). Reading the contents from such a file into RAM for further processing takes roughly 2 minutes per file. During that time only one core is utilized (but to 100%). So I guess the decompression part running on CPU is the bottleneck and not the IO throughput of the SSD.

在我的程序开始时，它将多个此类文件读取到RAM中，这需要相当长的时间. 我想通过利用更多的内核以及最终更多的RAM来加快该过程，直到SSD IO吞吐量成为限制因素.我正在使用的计算机具有足够的资源(20个CPU内核[+ 20 HT]和400 GB RAM)，并且只要浪费时间证明浪费" RAM没什么大不了的.

At the start of my program it reads multiple files of that kind into RAM, which takes quite some time. I like to speed up that process by utilizing more cores and eventually more RAM, until the SSD IO throughput is the limiting factor. The machine I'm working on has plenty resources (20 CPU cores [+ 20 HT] and 400 GB RAM) and »wasting« RAM is no big deal, as long as it is justified by saving time.

我自己有两个主意:

1)使用python的multiprocessing模块将多个文件并行读取到RAM中.这原则上有效，但是由于在multiprocessing中使用了Pickle(如

1) Use python's multiprocessing module to read several files into RAM in parallel. This works in principle, but due to the usage of Pickle within multiprocessing (as stated here), I hit the 4 GB serialization limit:

OverflowError('无法序列化大于4 GiB的字节对象').

OverflowError('cannot serialize a bytes object larger than 4 GiB').

2)使几个进程(使用multiprocessing模块中的Pool)打开同一个HDF5文件(使用with h5py.File('foo.h5', 'r') as h_file:)，从中读取单个块(chunk = h_file['label'][i : i + chunk_size])，然后返回该块.然后将收集的块连接起来.但是，此操作失败并显示

2) Make several processes (using a Pool from the multiprocessing module) open the same HDF5 file (using with h5py.File('foo.h5', 'r') as h_file:), read an individual chunk from it (chunk = h_file['label'][i : i + chunk_size]) and return that chunk. The gathered chunks will then be concatenated. However, this fails with an

OSError:无法读取数据(Fletcher32校验和检测到数据错误).

OSError: Can't read data (data error detected by Fletcher32 checksum).

这是由于以下事实:我在多个进程中打开了相同的文件(如建议的这里)?

Is this due to the fact, that I open the very same file within multiple processes (as suggested here)?

所以我的最后一个问题是:如何将.h5文件的内容更快地读入主存储器?再次:允许浪费时间的浪费" RAM.内容必须驻留在主存储器中，因此，仅读取行或分数来解决问题就不可行. 我知道我可以只存储未压缩的.h5文件，但这只是我要使用的最后一个选项，因为SSD上的空间不足.我更喜欢压缩文件和快速读取两者(最好是更好地利用可用资源).

So my final question is: How can I read the content of the .h5 files faster into main memory? Again: »Wasting« RAM in favor of saving time is permitted. The contents have to reside in main memory, so circumventing the problem by just reading lines, or fractions, is not an option. I know that I could just store the .h5 files uncompressed, but this is just the last option I like to use, since space on the SSD is scarce. I prefer haven both, compressed files and fast read (ideally by better utilizing the available resources).

元信息:我使用python 3.5.2和h5py 2.8.0.

Meta information: I use python 3.5.2 and h5py 2.8.0.

在读取文件时，SSD以72 MB/s的速度工作，而不是其最大值. .h5文件是使用h5py的 create_dataset 方法和compression="lzf"选项.

While reading the file, the SSD works with a speed of 72 MB/s, far from its maximum. The .h5 files were created by using h5py's create_dataset method with the compression="lzf" option.

这是(简化的)我用来读取(压缩的)HDF5文件内容的代码:

EDIT 2: This is (simplified) the code I use to read the content of a (compressed) HDF5 file:

def opener(filename, label): # regular version
    with h5py.File(filename, 'r') as h_file:
        data = g_file[label][:]
    return data

def fast_opener(filename, label): # multiple processes version
    with h5py.File(filename, 'r') as h_file:
        length = len(h_file[label])
    pool = Pool() # multiprocessing.Pool and not multiprocessing.dummy.Pool
    args_iter = zip(
        range(0, length, 1000),
        repeat(filename),
        repeat(label),
    )
    chunks = pool.starmap(_read_chunk_at, args_iter)
    pool.close()
    pool.join()
    return np.concatenate(chunks)

def _read_chunk_at(index, filename, label):
    with h5py.File(filename, 'r') as h_file:
        data = h_file[label][index : index + 1000]
    return data

如您所见，减压是由h5py透明完成的.

As you can see, the decompression is done by h5py transparently.

如何加快从压缩的HDF5文件读取的速度 [英] How to speed up reading from compressed HDF5 files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何加快从压缩的HDF5文件读取的速度 [英] How to speed up reading from compressed HDF5 files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭