是否可以使用多处理对一个h5py文件进行并行读取? [英] Is it possible to do parallel reads on one h5py file using multiprocessing?

查看:557
本文介绍了是否可以使用多处理对一个h5py文件进行并行读取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试加快从h5py数据集文件中读取块(将它们加载到RAM内存中)的过程.现在,我尝试通过多处理库执行此操作.

I am trying to speed up the process of reading chunks (load them into RAM memory) out of a h5py dataset file. Right now I try to do this via the multiprocessing library.

pool = mp.Pool(NUM_PROCESSES)
gen = pool.imap(loader, indices)

加载器功能如下:

def loader(indices):
    with h5py.File("location", 'r') as dataset:
        x = dataset["name"][indices]

这有时有时是可行的(这意味着预期的加载时间除以进程数,从而实现并行化).但是,在大多数情况下,它不是,并且加载时间仅与顺序加载数据时一样长.有什么我可以解决的吗?我知道h5py支持通过mpi4py进行并行读/写,但是我只想知道这是否对于仅读也是绝对必要的.

This actually sometimes works (meaning that the expected loading time is divided by the number of processes and thus parallelized). However, most of the time it doesn't and the loading time just stays as high as it was when loading the data sequentially. Is there anything I can do to fix this? I know h5py supports parallel read/writes through mpi4py but I would just want to know if that is absolutely necessary for only reads as well.

推荐答案

使用h5py可以并行读取,不需要MPI版本.但是为什么您期望这里会加速呢?您的工作几乎完全受I/O约束,而不是CPU约束.并行进程无济于事,因为瓶颈是您的硬盘,而不是CPU.如果在这种情况下并行化甚至减慢了整个读取操作的速度,那也不会令我感到惊讶. 还有其他意见吗?

Parallel reads are fine with h5py, no need for the MPI version. But why do you expect a speed-up here? Your job is almost entirely I/O bound, not CPU bound. Parallel processes are not gonna help because the bottleneck is your hard disk, not the CPU. It wouldn't surprise me if parallelization in this case even slowed down the whole reading operation. Other opinions?

这篇关于是否可以使用多处理对一个h5py文件进行并行读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆