避免同时读取多个数组数组文件 [英] Avoid simultaneously reading multiple files for a dask array

查看:90
本文介绍了避免同时读取多个数组数组文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从库中获得了一个读取文件并返回一个numpy数组的函数。

From a library, I get a function that reads a file and returns a numpy array.

我想用多个文件中的多个块构建一个Dask数组。

I want to build a Dask array with multiple blocks from multiple files.

每个块都是在文件上调用函数的结果。

Each block is the result of calling the function on a file.

当我要求Dask计算时, Dask是否会要求这些功能同时从硬盘读取多个文件?

When I ask Dask to compute, will Dask asks the functions to read multiple files from the hard disk at the same time?

如果是这种情况,如何避免这种情况?我的计算机没有并行文件系统。

If that is the case, how to avoid that? My computer doesn't have a parallel file system.

示例:

import numpy as np
import dask.array as da
import dask

# Make test data
n = 2
m = 3
x = np.arange(n * m, dtype=np.int).reshape(n, m)
np.save('0.npy', x)
np.save('1.npy', x)

# np.load is a function that reads a file 
# and returns a numpy array.

# Build delayed
y = [dask.delayed(np.load)('%d.npy' % i)
     for i in range(2)]

# Build individual Dask arrays.
# I can get the shape of each numpy array without 
# reading the whole file.
z = [da.from_delayed(a, (n, m), np.int) for a in y]

# Combine the dask arrays
w = da.vstack(z)

print(w.compute())


推荐答案

您可以使用分布式
锁定原语-这样,您的加载程序函数便会获取-读取-释放。

You could use an distributed lock primitive - so that your loader function does acquire-read-release.

read_lock = distributed.Lock('numpy-read')

@dask.delayed
def load_numpy(lock, fn):
    lock.acquire()
    out = np.load(fn)
    lock.release()
    return out

y = [load_numpy(lock, '%d.npy' % i) for i in range(2)]

此外, da.from_array 接受锁,因此您可以通过延迟函数 np.load创建单个数组直接提供锁。

Also, da.from_array accepts a lock, so you could maybe create individual arrays from the delayed function np.load directly supplying the lock.

或者,您可以分配
的单个单元资源给工作人员(具有多个线程),然后计算(或持久化)要求为1每个文件读取任务的单位,如链接文档中的示例。

Alternatively, you could assign a single unit of resource to the worker (with multiple threads), and then compute (or persist) with a requirement of one unit per file-read task, as in the example in the linked doc.

回复评论: to_hdf wasn问题中没有具体说明,我不确定为什么现在要质疑它;但是,可以将 da.store(compute = False) h5py.File 一起使用,然后指定资源在调用compute时使用。请注意,这不会将数据具体化到内存中。

Response to comment: to_hdf wasn't specified in the question, I am not sure why it is being questioned now; however, you can use da.store(compute=False) with a h5py.File, and then specify the resource to use when calling compute. Note that this does not materialise the data into memory.

这篇关于避免同时读取多个数组数组文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆