如何加快从压缩的HDF5文件读取的速度 [英] How to speed up reading from compressed HDF5 files

查看:738
本文介绍了如何加快从压缩的HDF5文件读取的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在SSD上存储了几个大的HDF5文件(lzf压缩文件大小为10–15 GB,未压缩大小为20–25 GB).将此类文件中的内容读取到RAM中以进行进一步处理,每个文件大约需要2分钟.在此期间,仅使用了一个内核(但达到了100%).因此,我猜想CPU上运行的减压部分是瓶颈,而不是SSD的IO吞吐量.

I have several big HDF5 file stored on an SSD (lzf compressed file size is 10–15 GB, uncompressed size would be 20–25 GB). Reading the contents from such a file into RAM for further processing takes roughly 2 minutes per file. During that time only one core is utilized (but to 100%). So I guess the decompression part running on CPU is the bottleneck and not the IO throughput of the SSD.

在我的程序开始时,它将多个此类文件读取到RAM中,这需要相当长的时间. 我想通过利用更多的内核以及最终更多的RAM来加快该过程,直到SSD IO吞吐量成为限制因素.我正在使用的计算机具有足够的资源(20个CPU内核[+ 20 HT]和400 GB RAM),并且只要浪费时间证明浪费" RAM没什么大不了的.

At the start of my program it reads multiple files of that kind into RAM, which takes quite some time. I like to speed up that process by utilizing more cores and eventually more RAM, until the SSD IO throughput is the limiting factor. The machine I'm working on has plenty resources (20 CPU cores [+ 20 HT] and 400 GB RAM) and »wasting« RAM is no big deal, as long as it is justified by saving time.

我自己有两个主意:

1)使用python的multiprocessing模块将多个文件并行读取到RAM中.这原则上有效,但是由于在multiprocessing中使用了Pickle(如

1) Use python's multiprocessing module to read several files into RAM in parallel. This works in principle, but due to the usage of Pickle within multiprocessing (as stated here), I hit the 4 GB serialization limit:

OverflowError('无法序列化大于4 GiB的字节对象').

OverflowError('cannot serialize a bytes object larger than 4 GiB').

2)使几个进程(使用multiprocessing模块中的Pool)打开同一个HDF5文件(使用with h5py.File('foo.h5', 'r') as h_file:),从中读取单个块(chunk = h_file['label'][i : i + chunk_size]),然后返回该块.然后将收集的块连接起来.但是,此操作失败并显示

2) Make several processes (using a Pool from the multiprocessing module) open the same HDF5 file (using with h5py.File('foo.h5', 'r') as h_file:), read an individual chunk from it (chunk = h_file['label'][i : i + chunk_size]) and return that chunk. The gathered chunks will then be concatenated. However, this fails with an

OSError:无法读取数据(Fletcher32校验和检测到数据错误).

OSError: Can't read data (data error detected by Fletcher32 checksum).

这是由于以下事实:我在多个进程中打开了相同的文件(如建议的这里)?

Is this due to the fact, that I open the very same file within multiple processes (as suggested here)?

所以我的最后一个问题是:如何将.h5文件的内容更快地读入主存储器?再次:允许浪费时间的浪费" RAM.内容必须驻留在主存储器中,因此,仅读取行或分数来解决问题就不可行. 我知道我可以只存储未压缩的.h5文件,但这只是我要使用的最后一个选项,因为SSD上的空间不足.我更喜欢压缩文件和快速读取两者(最好是更好地利用可用资源).

So my final question is: How can I read the content of the .h5 files faster into main memory? Again: »Wasting« RAM in favor of saving time is permitted. The contents have to reside in main memory, so circumventing the problem by just reading lines, or fractions, is not an option. I know that I could just store the .h5 files uncompressed, but this is just the last option I like to use, since space on the SSD is scarce. I prefer haven both, compressed files and fast read (ideally by better utilizing the available resources).

元信息:我使用python 3.5.2和h5py 2.8.0.

Meta information: I use python 3.5.2 and h5py 2.8.0.

在读取文件时,SSD以72 MB/s的速度工作,而不是其最大值. .h5文件是使用h5py的 create_dataset 方法和compression="lzf"选项.

While reading the file, the SSD works with a speed of 72 MB/s, far from its maximum. The .h5 files were created by using h5py's create_dataset method with the compression="lzf" option.

这是(简化的)我用来读取(压缩的)HDF5文件内容的代码:

EDIT 2: This is (simplified) the code I use to read the content of a (compressed) HDF5 file:

def opener(filename, label): # regular version
    with h5py.File(filename, 'r') as h_file:
        data = g_file[label][:]
    return data

def fast_opener(filename, label): # multiple processes version
    with h5py.File(filename, 'r') as h_file:
        length = len(h_file[label])
    pool = Pool() # multiprocessing.Pool and not multiprocessing.dummy.Pool
    args_iter = zip(
        range(0, length, 1000),
        repeat(filename),
        repeat(label),
    )
    chunks = pool.starmap(_read_chunk_at, args_iter)
    pool.close()
    pool.join()
    return np.concatenate(chunks)

def _read_chunk_at(index, filename, label):
    with h5py.File(filename, 'r') as h_file:
        data = h_file[label][index : index + 1000]
    return data

如您所见,减压是由h5py透明完成的.

As you can see, the decompression is done by h5py transparently.

推荐答案

h5py通过过滤器处理LZF文件的解压缩.过滤器的源代码(用C语言实现)是可在此处的h5py Github上获得.查看 lzf_decompress 的实现,它是导致瓶颈的功能,您可以看到它没有并行化(不知道它是否可以并行化,我将把判断力留给对LZF内部工作更熟悉的人).

h5py handles decompression of LZF files via a filter. The source code of the filter, implemened in C, is available on the h5py Github here. Looking at the implementation of lzf_decompress, which is the function causing your bottleneck, you can see it's not parallelized (No idea if it's even parallelizable, I'll leave that judgement to people more familiar to LZF inner workings).

话虽这么说,恐怕没有办法只是获取巨大的压缩文件并对其进行多线程解压缩.据我所知,您的选择是:

With that said, I'm afraid there's no way to just take your huge compressed file and multithread-decompress it. Your options, as far as I can tell, are:

  • 将大文件分割成较小的单独压缩的块,在单独的内核上并行解压缩每个块(multiprocessing可能会有所帮助,但您需要注意进程间共享内存),并将所有内容重新结合在一起解压缩后.
  • 只使用未压缩的文件.
  • Split the huge file in smaller individually-compressed chunks, parallel-decompress each chunk on a separate core (multiprocessing might help there but you'll need to take care about inter-process shared memory) and join everything back together after it's decompressed.
  • Just use uncompressed files.

这篇关于如何加快从压缩的HDF5文件读取的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆