如何找到为什么任务无法在dask分布式中失败? [英] How to find why a task fails in dask distributed?

查看:66
本文介绍了如何找到为什么任务无法在dask分布式中失败?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 dask.distributed 开发一个分布式计算系统。我使用 Executor.map 函数提交给它的任务有时会失败,而其他似乎相同的任务则会成功运行。

I am developing a distributed computing system using dask.distributed. Tasks that I submit to it with the Executor.map function sometimes fail, while others seeming identical, run successfully.

框架是否提供诊断问题的任何方法?

Does the framework provide any means to diagnose problems?

更新
失败是指增加由调度程序提供的Bokeh Web UI中失败任务的计数。

update By failing I mean increasing counter of failed tasks in the Bokeh web UI, provided by the scheduler. Counter of finished tasks increases too.

Executor.map 运行的函数返回。它与数据库进行通信,从其表中检索一些行,执行计算并更新值。

Function that is run by the Executor.map returns None. It communicates to a database, retrieves some rows from its table, performs calculations and updates values.

我在地图中有超过40000个任务,因此学习日志有点繁琐。

I've got more than 40000 tasks in map, so it is a bit tedious to study logs.

推荐答案

如果任务失败,那么任何尝试检索结果的尝试都会引发与工作人员相同的错误

If a task fails then any attempt to retrieve the result will raise the same error that occurred on the worker

In [1]: from distributed import Client

In [2]: c = Client()

In [3]: def div(x, y):
   ...:     return x / y
   ...: 

In [4]: future = c.submit(div, 1, 0)

In [5]: future.result()
<ipython-input-3-398a43a7781e> in div()
      1 def div(x, y):
----> 2     return x / y

ZeroDivisionError: division by zero

但是,其他事情可能会出错。例如,您的工作人员可能没有与客户端上相同的软件,或者您的网络可能不允许连接通过,或者现实网络中发生的任何其他情况。为了帮助诊断这些问题,有一些选择:

However, other things can go wrong. For example you might not have the same software on your workers as on your client or your network might not let connections go through, or any of the other things that happen in real-world networks. To help diagnose these there are a few options:


  1. 您可以使用网络界面跟踪任务和工作人员的进度

  2. 您可以启动调度程序或工作程序中的IPython内核,直接对其进行检查

  1. You can use the web interface to track the progress of your tasks and workers
  2. You can start IPython kernels in the scheduler or workers to inspect them directly

这篇关于如何找到为什么任务无法在dask分布式中失败?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆