使用分布式Dask调度程序重复执行任务 [英] Repeated task execution using the distributed Dask scheduler

查看:107
本文介绍了使用分布式Dask调度程序重复执行任务的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Dask分布式调度程序,在本地运行一个调度程序和5个工人。我向 compute()提交了 delayed()任务列表。

I'm using the Dask distributed scheduler, running a scheduler and 5 workers locally. I submit a list of delayed() tasks to compute().

当任务数量为20(数字>>比工作人员数量)并且每个任务花费至少15秒时,调度程序开始重新运行某些任务(或在其中执行)

When the number of tasks is say 20 (a number >> than the number of workers) and each task takes say at least 15 secs, the scheduler starts rerunning some of the tasks (or executes them in parallel more than once).

这是一个问题,因为任务会修改SQL数据库,如果再次运行它们,最终会引发Exception(由于数据库唯一性约束)。我不是在任何地方设置 pure = True (我相信默认值为 False )。除此之外,Dask图是微不足道的(任务之间没有依赖性)。

This is a problem since the tasks modify a SQL db and if they run again they end up raising an Exception (due to DB uniqueness constraints). I'm not setting pure=True anywhere (and I believe the default is False). Other than that, the Dask graph is trivial (no dependencies between the tasks).

仍然不确定这是Dask的功能还是错误。我有一种直觉,认为这可能与员工偷窃有关……

Still not sure if this is a feature or a bug in Dask. I have a gut feeling that this might be related to worker stealing...

推荐答案

如果任务分配给一个,则正确工人和另一名工人变得自由,它可以选择从同伴那里窃取多余的任务。可能会窃取刚刚开始运行的任务,在这种情况下,该任务将运行两次。

Correct, if a task is allocated to one worker and another worker becomes free it may choose to steal excess tasks from its peers. There is a chance that it will steal a task that has just started to run, in which case the task will run twice.

处理此问题的干净方法是:确保您的任务是幂等的,即使运行两次也可以返回相同的结果。这可能意味着在您的任务中处理数据库错误。

The clean way to handle this problem is to ensure that your tasks are idempotent, that they return the same result even if run twice. This might mean handling your database error within your task.

这是对数据密集型计算工作负载非常有效但对数据工程工作负载而言非常糟糕的那些策略之一。设计一个能够同时满足这两种需求的系统很棘手。

This is one of those policies that are great for data intensive computing workloads but terrible for data engineering workloads. It's tricky to design a system that satisfies both needs simultaneously.

这篇关于使用分布式Dask调度程序重复执行任务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆