Dask For Loop平行 [英] Dask For Loop In Parallel

查看:93
本文介绍了Dask For Loop平行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用延迟延迟的for循环找到正确的语法。我发现了一些教程和其他问题,但都不适合我的情况,这是非常基础的。



首先,这是并行运行for循环的正确方法吗?

  %% time 

list_names = ['a','b','c', 'd']
keep_return = []

@delayed
def loop_dummy(target):
for i in range(1000000000):
pass
print('passed value:'+ target)
return(1)


for list_names中的i:
c = loop_dummy(i)
keep_return.append(c)


total =延迟(sum)(keep_return)
total.compute()

生成的

 传递的值是:a 
传递的值是:b
传递的值是:c
传递的值是:d
挂牌时间:1min 53s

如果我以串行方式运行

  %% time 

list_names = ['a','b','c','d']
keep_return = []


def loop_dummy(target):
对于范围在i(1000000000)的用户:
通过
打印('通过的值是:'+目标)
返回(1)


在列表名称中:
c = loop_dummy(i)
keep_return.append(c)

实际上更快。

 传递的值为:a 
传递的值为:b
传递的值为:c
传递的值是:d
挂墙时间:1分钟49s

有人说Dask会有少量开销,但这似乎花了足够长的时间来证明,不是吗?



我实际的for循环涉及较重的计算,因此我建立了一个

解决方案

此计算

  for i in range(...):
通过

受GIL约束。您将要使用multiprocessing或dask.distributed Dask后端,而不是默认的线程后端。我建议以下内容:

  total.compute(scheduler ='multiprocessing)

但是,如果您的实际计算主要是Numpy / Pandas / Scikit-Learn /其他数字程序包代码,则默认线程后端可能是正确的选择。 / p>

有关在调度程序之间进行选择的更多信息,请参见: http://dask.pydata.org/en/latest/scheduling.html


I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic.

First, is this the correct way to run a for-loop in parallel?

%%time

list_names=['a','b','c','d']
keep_return=[]

@delayed
def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)


total = delayed(sum)(keep_return)
total.compute()

This produced

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 53s

If I run this in serial,

%%time

list_names=['a','b','c','d']
keep_return=[]


def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)

it is actually faster.

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 49s

I have seen examples where it was stated there is a small amount of overhead for Dask, but this seems to take long enough to justify, no?

My actual for loop involves heavier computation where I build a model for various targets.

解决方案

This computation

for i in range(...):
    pass

Is bound by the GIL. You will want to use the multiprocessing or dask.distributed Dask backends rather than the default threading backend. I recommend the following:

total.compute(scheduler='multiprocessing)

However, if your actual computation is mostly Numpy/Pandas/Scikit-Learn/Other numeric package code, then the default threading backend is probably the right choice.

More information about choosing between schedulers is available here: http://dask.pydata.org/en/latest/scheduling.html

这篇关于Dask For Loop平行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆