了解将多个文件内容加载到Dask Array的过程及其扩展方式 [英] Understanding the process of loading multiple file contents into Dask Array and how it scales

查看:67
本文介绍了了解将多个文件内容加载到Dask Array的过程及其扩展方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 http://dask.pydata.org/en/上的示例Latest / array-creation.html

filenames = sorted(glob('2015-*-*.hdf5')
dsets = [h5py.File(fn)['/data'] for fn in filenames]
arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets]
x = da.concatenate(arrays, axis=0)  # Concatenate arrays along first axis

我无法理解下一行,或者是指向dask_array的 dask数组还是指向正常 np数组,它指向与返回的所有hdf5文件中的数据集一样多的dask数组。

I'm having trouble understanding the next line and whether its a dask_array of "dask arrays" or a "normal" np array which points to as many dask arrays as there were datasets in all the hdf5 files that gets returned.

由于 da.from_array ,在文件读取阶段性能(基于线程或内存)是否有所增加?或仅当连接到dask数组 x 时,您应该期望得到改进

Is there any increase in performance (thread or memory based) during the file read stage as a result of the da.from_array or is only when you concatenate into the dask array x where you should expect improvements

推荐答案

对象数组列表中的所有数组数组都是dask数组,每个文件一个。

The objects in the arrays list are all dask arrays, one for each file.

x 对象也是一个dask数组,它组合了<$中的dask数组的所有结果。 c $ c>数组列表。它不是dask.array的dask数组,它只是一个单一的dask数组,具有更大的第一维度。

The x object is also a dask array that combines all of the results of the dask arrays in the arrays list. It isn't a dask.array of dask arrays, it's just a single flattened dask array with an a larger first dimension.

性能可能不会提高用于读取数据。您可能会受到磁盘带宽的I / O约束。在这种情况下,大多数人都使用dask.array,因为他们拥有的数据量超出了可以方便地装入RAM的范围。如果这对您没有用,那么我会坚持使用NumPy。

There will probably not be an increase in performance for reading data. You're likely to be I/O bound by your disk bandwidth. Most people in this situation are using dask.array because they have more data than can conveniently fit into RAM. If this isn't valuable to you then I would stick with NumPy.

这篇关于了解将多个文件内容加载到Dask Array的过程及其扩展方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆