dask.distributed LocalCluster与线程和进程之间的区别 [英] Difference between dask.distributed LocalCluster with threads vs. processes
问题描述
以下用于 dask.distributed
的 LocalCluster
配置之间有什么区别?
What is the difference between the following LocalCluster
configurations for dask.distributed
?
Client(n_workers=4, processes=False, threads_per_worker=1)
与
Client(n_workers=1, processes=True, threads_per_worker=4)
它们都在任务图上有四个线程,但是第一个有四个工作器.那么,让多个工作人员充当线程而不是单个工作人员具有多个线程有什么好处?
They both have four threads working on the task graph, but the first has four workers. What, then, would be the benefit of having multiple workers acting as threads as opposed to a single worker with multiple threads?
编辑:只是为了澄清,我知道进程,线程和共享内存之间的差异,因此,该问题更多地针对这两个客户端的配置差异.
Edit: just a clarification, I'm aware of the difference between processes, threads and shared memory, so this question is oriented more towards the configurational differences of these two Clients.
推荐答案
我从Victor和Martin的答案中得到了启发,对我进行了更深入的研究,因此,这里是对我的理解的深入总结.(无法在评论中这样做)
I was inspired by both Victor and Martin's answers to dig a little deeper, so here's an in-depth summary of my understanding. (couldn't do it in a comment)
首先,请注意,此版本的dask中的调度程序打印输出不是很直观. processes
实际上是工作线程数, cores
实际上是所有工作线程中线程的总数.
First, note that the scheduler printout in this version of dask isn't quite intuitive. processes
is actually the number of workers, cores
is actually the total number of threads in all workers.
第二,值得一提的是Victor关于TCP地址以及添加/连接更多工作程序的评论.我不确定是否可以通过 processes = False
将更多的工人添加到群集中,但是我认为答案可能是肯定的.
Secondly, Victor's comments about the TCP address and adding/connecting more workers are good to point out. I'm not sure if more workers could be added to a cluster with processes=False
, but I think the answer is probably yes.
现在,考虑以下脚本:
from dask.distributed import Client
if __name__ == '__main__':
with Client(processes=False) as client: # Config 1
print(client)
with Client(processes=False, n_workers=4) as client: # Config 2
print(client)
with Client(processes=False, n_workers=3) as client: # Config 3
print(client)
with Client(processes=True) as client: # Config 4
print(client)
with Client(processes=True, n_workers=3) as client: # Config 5
print(client)
with Client(processes=True, n_workers=3,
threads_per_worker=1) as client: # Config 6
print(client)
这会为我的笔记本电脑(4核)在 dask
版本2.3.0中产生以下输出:
This produces the following output in dask
version 2.3.0 for my laptop (4 cores):
<Client: scheduler='inproc://90.147.106.86/14980/1' processes=1 cores=4>
<Client: scheduler='inproc://90.147.106.86/14980/9' processes=4 cores=4>
<Client: scheduler='inproc://90.147.106.86/14980/26' processes=3 cores=6>
<Client: scheduler='tcp://127.0.0.1:51744' processes=4 cores=4>
<Client: scheduler='tcp://127.0.0.1:51788' processes=3 cores=6>
<Client: scheduler='tcp://127.0.0.1:51818' processes=3 cores=3>
这是我对配置之间差异的理解:
Here's my understanding of the differences between the configurations:
- 调度程序和所有工作程序在客户端进程中作为线程运行.(如Martin所说,这对于自省很有用.)由于既没有提供工作程序的数量也没有给出线程/工作程序的数量,因此
dask
将其函数nprocesses_nthreads()
调用为设置默认值(processes = False
,1个进程和线程数等于可用内核). - 与1相同,但是由于给出了
n_workers
个,因此dask选择了线程/workers,以使线程总数等于核心数(即1).同样,打印输出中的processes
并不完全正确-它实际上是worker的数量(在这种情况下,实际上是线程). - 与2相同,但是由于
n_workers
不能平均分配给内核数,因此dask选择2个线程/每个工人过量使用而不是不足. - 客户,调度程序和所有工作程序都是独立的进程.Dask选择默认的worker数量(等于< = 4,因此等于核心)和默认的线程/worker数量(1).
- 与5相同的进程/线程配置,但由于与3相同的原因,总线程被超额预定.
- 这表现出预期.
- The scheduler and all workers are run as threads within the Client process. (As Martin said, this is useful for introspection.) Because neither the number of workers or the number of threads/worker is given,
dask
calls its functionnprocesses_nthreads()
to set the defaults (withprocesses=False
, 1 process and threads equal to available cores). - Same as 1, but since
n_workers
was given, the threads/workers is chosen by dask such that the total number of threads is equal to the number of cores (i.e., 1). Again,processes
in the print output is not exactly correct -- it's actually the number of workers (which in this case are actually threads). - Same as 2, but since
n_workers
doesn't divide equally into the number of cores, dask chooses 2 threads/worker to overcommit instead of undercommit. - The Client, Scheduler and all workers are separate processes. Dask chooses the default number of workers (equal to cores because it's <= 4) and the default number of threads/worker (1).
- Same processes/thread configuration as 5, but the total threads are overprescribed for the same reason as 3.
- This behaves as expected.
这篇关于dask.distributed LocalCluster与线程和进程之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!