Dask For Loop平行 [英] Dask For Loop In Parallel

查看：93 发布时间：2020/6/11 1:53:45 dask dask-delayed

本文介绍了Dask For Loop平行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用延迟延迟的for循环找到正确的语法。我发现了一些教程和其他问题，但都不适合我的情况，这是非常基础的。

首先，这是并行运行for循环的正确方法吗？

  %% time 
 
 list_names = ['a'，'b'，'c'， 'd'] 
 keep_return = [] 
 
 @delayed 
 def loop_dummy（target）：
 for i in range（1000000000）：
 pass 
 print（'passed value：'+ target）
 return（1）
 
 
 for list_names中的i：
c = loop_dummy（i）
 keep_return.append（c）
 
 
 total =延迟（sum）（keep_return）
 total.compute（）

生成的

 传递的值是：a 
传递的值是：b 
传递的值是：c 
传递的值是：d 
挂牌时间：1min 53s

如果我以串行方式运行

  %% time 
 
 list_names = ['a'，'b'，'c'，'d'] 
 keep_return = [] 
 
 
 def loop_dummy（target）： 
对于范围在i（1000000000）的用户：
通过
打印（'通过的值是：'+目标）
返回（1）
 
 
在列表名称中：
c = loop_dummy（i）
 keep_return.append（c）

实际上更快。

 传递的值为：a 
传递的值为：b 
传递的值为：c 
传递的值是：d 
挂墙时间：1分钟49s

有人说Dask会有少量开销，但这似乎花了足够长的时间来证明，不是吗？

我实际的for循环涉及较重的计算，因此我建立了一个

解决方案

此计算

  for i in range（...）：
通过

受GIL约束。您将要使用multiprocessing或dask.distributed Dask后端，而不是默认的线程后端。我建议以下内容：

  total.compute（scheduler ='multiprocessing）

但是，如果您的实际计算主要是Numpy / Pandas / Scikit-Learn /其他数字程序包代码，则默认线程后端可能是正确的选择。 / p>

有关在调度程序之间进行选择的更多信息，请参见： http://dask.pydata.org/en/latest/scheduling.html

I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic.

First, is this the correct way to run a for-loop in parallel?

%%time

list_names=['a','b','c','d']
keep_return=[]

@delayed
def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)


total = delayed(sum)(keep_return)
total.compute()

This produced

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 53s

If I run this in serial,

%%time

list_names=['a','b','c','d']
keep_return=[]


def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)

it is actually faster.

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 49s

I have seen examples where it was stated there is a small amount of overhead for Dask, but this seems to take long enough to justify, no?

My actual for loop involves heavier computation where I build a model for various targets.

解决方案

This computation

for i in range(...):
    pass

Is bound by the GIL. You will want to use the multiprocessing or dask.distributed Dask backends rather than the default threading backend. I recommend the following:

total.compute(scheduler='multiprocessing)

However, if your actual computation is mostly Numpy/Pandas/Scikit-Learn/Other numeric package code, then the default threading backend is probably the right choice.

More information about choosing between schedulers is available here: http://dask.pydata.org/en/latest/scheduling.html

这篇关于Dask For Loop平行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Dask For Loop平行 [英] Dask For Loop In Parallel

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Dask For Loop平行 [英] Dask For Loop In Parallel

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭