使用Dask读取多个文件 [英] Reading multiple files with Dask

查看:55
本文介绍了使用Dask读取多个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试轻松地并行读取24个科学数据文件,每个文件约250MB,因此总计约6GB.数据为2D数组格式.它存储在并行文件系统中,并从群集中读取,尽管我现在仅从单个节点读取.数据采用类似于HDF5(称为Adios)的格式,并且类似于h5py包进行读取.每个文件大约需要4秒钟才能读取.我正在阅读此处的skimage示例(

I'm trying out dask on a simple embarassingly parallel reading of 24 scientific data files, each of ~250MB, so total ~6GB. The data is in a 2D array format. Its stored on a parallel file system, and read in from a cluster, though I'm reading only from a single node right now. The data is in a format similar to HDF5 (called Adios), and is read similar to h5py package. Each file takes about 4 seconds to read. I'm following the example of skimage read here (http://docs.dask.org/en/latest/array-creation.html). However, I never get a speed up, no matter how many workers. I thought perhaps I was using it wrong, and perhaps only using 1 worker still, but when I profile it, there does appear to be 24 workers. How can I get a speed up for reading this data?

import adios as ad
import numpy as np
import dask.array as da
import dask

bpread = dask.delayed(lambda f: ad.file(f)['data'][...],pure=True)
lazy_datas = [bpread(path) for path in paths]
sample = lazy_datas[0].compute()

#read in data
arrays = [da.from_delayed(lazy_data,dtype=sample.dtype,shape=sample.shape) for lazy_data in lazy_datas]
datas = da.stack(arrays,axis=0)
datas2 = datas.compute(scheduler='processes',num_workers=24)

推荐答案

我建议查看调度程序仪表板的/profile 标签.这将告诉您哪些代码行占用最多的时间.

I recommend looking at the /profile tab of the scheduler's dashboard. This will tell you what lines of code are taking up the most time.

我的第一个猜测是,您已经在最大限度地利用磁盘为您提供数据的能力.您不受CPU的限制,因此添加更多内核将无济于事.不过,这只是一个猜测,与往常一样,您必须进行概要分析并进一步调查您的情况才能确定.

My first guess is that you are already maxing out your disk's ability to serve data to you. You aren't CPU bound, so adding more cores won't help. That's just a guess though, as always you'll have to profile and investigate your situation further to know for sure.

这篇关于使用Dask读取多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆