设置敏捷工作者数量的最佳做法 [英] Best practices in setting number of dask workers

查看:77
本文介绍了设置敏捷工作者数量的最佳做法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在集群上设置工作者时,dask和dask.distributed中使用的不同术语让我有些困惑.

I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster.

我遇到的术语是:线程,进程,处理器,节点,工作程序,调度程序.

The terms I came across are: thread, process, processor, node, worker, scheduler.

我的问题是如何设置每个数字,以及这些数字之间是否存在严格或推荐的关系.例如:

My question is how to set the number of each, and if there is a strict or recommend relationship between any of these. For example:

  • 每个节点1个工作线程,该节点上有n个核心的n个进程
  • 线程和进程是相同的概念吗?在dask-mpi中,我必须设置nthreads,但是它们在客户端中显示为进程

还有其他建议吗?

推荐答案

节点"通常是指物理机或虚拟机.该节点可以一次运行多个程序或进程(就像我的计算机如何一次运行Web浏览器和文本编辑器一样).每个进程可以在自己内部与许多线程并行化.进程具有隔离的内存环境,这意味着在进程内共享数据是免费的,而在进程之间共享数据则非常昂贵.

By "node" people typically mean a physical or virtual machine. That node can run several programs or processes at once (much like how my computer can run a web browser and text editor at once). Each process can parallelize within itself with many threads. Processes have isolated memory environments, meaning that sharing data within a process is free, while sharing data between processes is expensive.

通常,如果将较大的节点划分为几个进程,每个进程都有多个线程,那么它们在大型节点(例如36个内核)上的工作效果最佳.您希望进程数乘以线程数等于内核数.因此,例如,您可以对36核计算机执行以下操作:

Typically things work best on larger nodes (like 36 cores) if you cut them up into a few processes, each of which have several threads. You want the number of processes times the number of threads to equal the number of cores. So for example you might do something like the following for a 36 core machine:

  • 四个进程,每个进程有九个线程
  • 十二个进程,每个进程具有三个线程
  • 一个具有36个线程的进程

通常,根据工作负载在这些选择之间做出选择.此处的差异是由于Python的Global Interpreter Lock所致,它限制了某些数据的并行性.如果您主要使用Python中的Numpy,Pandas,Scikit-Learn或其他数字编程库工作,则无需担心GIL,并且您可能希望优先选择每个进程都带有多个线程的进程.这有帮助,因为它允许数据在内核之间自由移动,因为它们都存在于同一进程中.但是,如果您主要进行纯Python编程,例如处理文本数据,字典/列表/集合,并使用紧密的Python for loops进行大部分计算,那么您将希望拥有多个进程,每个进程只有很少的线程.这会产生额外的通信成本,但可以让您绕过GIL.

Typically one decides between these choices based on the workload. The difference here is due to Python's Global Interpreter Lock, which limits parallelism for some kinds of data. If you are working mostly with Numpy, Pandas, Scikit-Learn, or other numerical programming libraries in Python then you don't need to worry about the GIL, and you probably want to prefer few processes with many threads each. This helps because it allows data to move freely between your cores because it all lives in the same process. However, if you're doing mostly Pure Python programming, like dealing with text data, dictionaries/lists/sets, and doing most of your computation in tight Python for loops then you'll want to prefer having many processes with few threads each. This incurs extra communication costs, but lets you bypass the GIL.

简而言之,如果您主要使用numpy/pandas样式的数据,请尝试在一个进程中至少获取八个线程左右.否则,可能在一个进程中只使用两个线程.

In short, if you're using mostly numpy/pandas-style data, try to get at least eight threads or so in a process. Otherwise, maybe go for only two threads in a process.

这篇关于设置敏捷工作者数量的最佳做法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆