dask.distributed LocalCluster与线程和进程之间的区别 [英] Difference between dask.distributed LocalCluster with threads vs. processes

查看:81
本文介绍了dask.distributed LocalCluster与线程和进程之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下用于 dask.distributed LocalCluster 配置之间有什么区别?

What is the difference between the following LocalCluster configurations for dask.distributed?

Client(n_workers=4, processes=False, threads_per_worker=1)

Client(n_workers=1, processes=True, threads_per_worker=4)

它们都在任务图上有四个线程,但是第一个有四个工作器.那么,让多个工作人员充当线程而不是单个工作人员具有多个线程有什么好处?

They both have four threads working on the task graph, but the first has four workers. What, then, would be the benefit of having multiple workers acting as threads as opposed to a single worker with multiple threads?

编辑:只是为了澄清,我知道进程,线程和共享内存之间的差异,因此,该问题更多地针对这两个客户端的配置差异.

Edit: just a clarification, I'm aware of the difference between processes, threads and shared memory, so this question is oriented more towards the configurational differences of these two Clients.

推荐答案

我从Victor和Martin的答案中得到了启发,对我进行了更深入的研究,因此,这里是对我的理解的深入总结.(无法在评论中这样做)

I was inspired by both Victor and Martin's answers to dig a little deeper, so here's an in-depth summary of my understanding. (couldn't do it in a comment)

首先,请注意,此版本的dask中的调度程序打印输出不是很直观. processes 实际上是工作线程数, cores 实际上是所有工作线程中线程的总数.

First, note that the scheduler printout in this version of dask isn't quite intuitive. processes is actually the number of workers, cores is actually the total number of threads in all workers.

第二,值得一提的是Victor关于TCP地址以及添加/连接更多工作程序的评论.我不确定是否可以通过 processes = False 将更多的工人添加到群集中,但是我认为答案可能是肯定的.

Secondly, Victor's comments about the TCP address and adding/connecting more workers are good to point out. I'm not sure if more workers could be added to a cluster with processes=False, but I think the answer is probably yes.

现在,考虑以下脚本:

from dask.distributed import Client

if __name__ == '__main__':
    with Client(processes=False) as client:  # Config 1
        print(client)
    with Client(processes=False, n_workers=4) as client:  # Config 2
        print(client)
    with Client(processes=False, n_workers=3) as client:  # Config 3
        print(client)
    with Client(processes=True) as client:  # Config 4
        print(client)
    with Client(processes=True, n_workers=3) as client:  # Config 5
        print(client)
    with Client(processes=True, n_workers=3,
                threads_per_worker=1) as client:  # Config 6
        print(client)

这会为我的笔记本电脑(4核)在 dask 版本2.3.0中产生以下输出:

This produces the following output in dask version 2.3.0 for my laptop (4 cores):

<Client: scheduler='inproc://90.147.106.86/14980/1' processes=1 cores=4>
<Client: scheduler='inproc://90.147.106.86/14980/9' processes=4 cores=4>
<Client: scheduler='inproc://90.147.106.86/14980/26' processes=3 cores=6>
<Client: scheduler='tcp://127.0.0.1:51744' processes=4 cores=4>
<Client: scheduler='tcp://127.0.0.1:51788' processes=3 cores=6>
<Client: scheduler='tcp://127.0.0.1:51818' processes=3 cores=3>

这是我对配置之间差异的理解:

Here's my understanding of the differences between the configurations:

  1. 调度程序和所有工作程序在客户端进程中作为线程运行.(如Martin所说,这对于自省很有用.)由于既没有提供工作程序的数量也没有给出线程/工作程序的数量,因此 dask 将其函数 nprocesses_nthreads()调用为设置默认值( processes = False ,1个进程和线程数等于可用内核).
  2. 与1相同,但是由于给出了 n_workers 个,因此dask选择了线程/workers,以使线程总数等于核心数(即1).同样,打印输出中的 processes 并不完全正确-它实际上是worker的数量(在这种情况下,实际上是线程).
  3. 与2相同,但是由于 n_workers 不能平均分配给内核数,因此dask选择2个线程/每个工人过量使用而不是不足.
  4. 客户,调度程序和所有工作程序都是独立的进程.Dask选择默认的worker数量(等于< = 4,因此等于核心)和默认的线程/worker数量(1).
  5. 与5相同的进程/线程配置,但由于与3相同的原因,总线程被超额预定.
  6. 这表现出预期.
  1. The scheduler and all workers are run as threads within the Client process. (As Martin said, this is useful for introspection.) Because neither the number of workers or the number of threads/worker is given, dask calls its function nprocesses_nthreads() to set the defaults (with processes=False, 1 process and threads equal to available cores).
  2. Same as 1, but since n_workers was given, the threads/workers is chosen by dask such that the total number of threads is equal to the number of cores (i.e., 1). Again, processes in the print output is not exactly correct -- it's actually the number of workers (which in this case are actually threads).
  3. Same as 2, but since n_workers doesn't divide equally into the number of cores, dask chooses 2 threads/worker to overcommit instead of undercommit.
  4. The Client, Scheduler and all workers are separate processes. Dask chooses the default number of workers (equal to cores because it's <= 4) and the default number of threads/worker (1).
  5. Same processes/thread configuration as 5, but the total threads are overprescribed for the same reason as 3.
  6. This behaves as expected.

这篇关于dask.distributed LocalCluster与线程和进程之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆