20个进程中的400个线程胜过4个进程中的400个线程,同时在4个CPU上执行CPU绑定任务 [英] 400 threads in 20 processes outperform 400 threads in 4 processes while performing a CPU-bound task on 4 CPUs

查看:157
本文介绍了20个进程中的400个线程胜过4个进程中的400个线程,同时在4个CPU上执行CPU绑定任务的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题与 20个进程中的400个线程在执行I/O绑定任务时胜过4个进程中的400个线程.唯一的区别是,链接的问题与I/O绑定的任务有关,而此问题与CPU绑定的任务有关.

This question is very similar to 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task. The only difference is that the linked question is about an I/O-bound task whereas this question is about a CPU-bound task.

这里是实验代码,可以启动指定数量的工作进程,然后在每个进程中启动指定数量的工作线程,并执行计算第n个质数的任务.

Here is the experimental code that can launch a specified number of worker processes and then launch a specified number of worker threads within each process and perform the task of computing the n-th prime number.

import math
import multiprocessing
import random
import sys
import time
import threading

def main():
    processes = int(sys.argv[1])
    threads = int(sys.argv[2])
    tasks = int(sys.argv[3])

    # Start workers.
    in_q = multiprocessing.Queue()
    process_workers = []
    for _ in range(processes):
        w = multiprocessing.Process(target=process_worker, args=(threads, in_q))
        w.start()
        process_workers.append(w)

    start_time = time.time()

    # Feed work.
    for nth in range(1, tasks + 1):
        in_q.put(nth)

    # Send sentinel for each thread worker to quit.
    for _ in range(processes * threads):
        in_q.put(None)

    # Wait for workers to terminate.
    for w in process_workers:
        w.join()

    total_time = time.time() - start_time
    task_speed = tasks / total_time

    print('{:3d} x {:3d} workers => {:6.3f} s, {:5.1f} tasks/s'
          .format(processes, threads, total_time, task_speed))



def process_worker(threads, in_q):
    thread_workers = []
    for _ in range(threads):
        w = threading.Thread(target=thread_worker, args=(in_q,))
        w.start()
        thread_workers.append(w)

    for w in thread_workers:
        w.join()


def thread_worker(in_q):
    while True:
        nth = in_q.get()
        if nth is None:
            break
        num = find_nth_prime(nth)
        #print(num)


def find_nth_prime(nth):
    # Find n-th prime from scratch.
    if nth == 0:
        return

    count = 0
    num = 2
    while True:
        if is_prime(num):
            count += 1

        if count == nth:
            return num

        num += 1


def is_prime(num):
    for i in range(2, int(math.sqrt(num)) + 1):
        if num % i == 0:
            return False
    return True


if __name__ == '__main__':
    main()

这是我运行此程序的方式:

Here is how I run this program:

python3 foo.py <PROCESSES> <THREADS> <TASKS>

例如,python3 foo.py 20 20 2000创建20个工作进程,每个工作进程中有20个线程(因此共有400个工作线程),并执行2000个任务.最后,该程序打印出执行任务所花费的时间以及平均每秒执行多少任务.

For example, python3 foo.py 20 20 2000 creates 20 worker processes with 20 threads in each worker process (thus a total of 400 worker threads) and performs 2000 tasks. In the end, this program prints how much time it took to perform the tasks and how many tasks it did per second on an average.

我正在具有8 GB RAM和4个CPU的Linode虚拟专用服务器上测试此代码.它正在运行Debian 9.

I am testing this code on a Linode virtual private server that has 8 GB RAM and 4 CPUs. It is running Debian 9.

$ cat /etc/debian_version 
9.9

$ python3
Python 3.5.3 (default, Sep 27 2018, 17:25:39) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7987          67        7834          10          85        7734
Swap:           511           0         511

$ nproc
4

情况1:20个进程x 20个线程

这里有一些试运行,其中400个工作线程分布在20个工作进程之间(即20个工作进程中的每个工作进程中都有20个工作线程).

Case 1: 20 Processes x 20 Threads

Here are a few trial runs with 400 worker threads distributed between 20 worker processes (i.e., 20 worker threads in each of the 20 worker processes).

以下是结果:

$ python3 bar.py 20 20 2000
 20 x  20 workers => 12.702 s, 157.5 tasks/s

$ python3 bar.py 20 20 2000
 20 x  20 workers => 13.196 s, 151.6 tasks/s

$ python3 bar.py 20 20 2000
 20 x  20 workers => 12.224 s, 163.6 tasks/s

$ python3 bar.py 20 20 2000
 20 x  20 workers => 11.725 s, 170.6 tasks/s

$ python3 bar.py 20 20 2000
 20 x  20 workers => 10.813 s, 185.0 tasks/s

当我使用top命令监视CPU使用率时,我发现每个python3工作进程都消耗大约15%到25%的CPU.

When I monitor the CPU usage with the top command, I see that each python3 worker process consumes about 15% to 25% CPU.

现在我认为我只有4个CPU.即使我启动了20个工作进程,在物理时间的任何时候最多也只能运行4个进程.此外,由于全局解释器锁(GIL),每个进程中只能有一个线程(因此最多四个线程)可以在物理时间的任何时间运行.

Now I thought that I only have 4 CPUs. Even if I launch 20 worker processes, at most only 4 processes can run at any point in physical time. Further due to global interpreter lock (GIL), only one thread in each process (thus a total of 4 threads at most) can run at any point in physical time.

因此,我认为如果将进程数减少到4,并将每个进程的线程数增加到100,以使线程总数仍然保持400,则性能应该不会降低.

Therefore, I thought if I reduce the number of processes to 4 and increase the number of threads per process to 100, so that the total number of threads still remain 400, the performance should not deteriorate.

但是测试结果显示,每个包含100个线程的4个进程的性能始终比每个包含20个线程的20个进程的性能差.

But the test results show that 4 processes containing 100 threads each consistently perform worse than 20 processes containing 20 threads each.

$ python3 bar.py 4 100 2000
  4 x 100 workers => 19.840 s, 100.8 tasks/s

$ python3 bar.py 4 100 2000
  4 x 100 workers => 22.716 s,  88.0 tasks/s

$ python3 bar.py 4 100 2000
  4 x 100 workers => 20.278 s,  98.6 tasks/s

$ python3 bar.py 4 100 2000
  4 x 100 workers => 19.896 s, 100.5 tasks/s

$ python3 bar.py 4 100 2000
  4 x 100 workers => 19.876 s, 100.6 tasks/s

每个python3工作进程的CPU使用率在50%到66%之间.

The CPU usage is between 50% to 66% for each python3 worker process.

为了进行比较,我记录了一个事实,即情况1和情况2均优于单个进程中所有400个线程的情况.这显然是由于全局解释器锁(GIL).

Just for comparison, I am recording the fact that both case 1 and case 2 outperform the case where we have all 400 threads in a single process. This is obviously due to the global interpreter lock (GIL).

$ python3 bar.py 1 400 2000
  1 x 400 workers => 34.762 s,  57.5 tasks/s

$ python3 bar.py 1 400 2000
  1 x 400 workers => 35.276 s,  56.7 tasks/s

$ python3 bar.py 1 400 2000
  1 x 400 workers => 32.589 s,  61.4 tasks/s

$ python3 bar.py 1 400 2000
  1 x 400 workers => 33.974 s,  58.9 tasks/s

$ python3 bar.py 1 400 2000
  1 x 400 workers => 35.429 s,  56.5 tasks/s

单个python3工作进程的CPU使用率在110%到115%之间.

The CPU usage is between 110% and 115% for the single python3 worker process.

同样,为了比较,这是当有400个进程(每个进程都有一个线程)时结果的样子.

Again, just for comparison, here is how the results look when there are 400 processes, each with a single thread.

$ python3 bar.py 400 1 2000
400 x   1 workers =>  8.814 s, 226.9 tasks/s

$ python3 bar.py 400 1 2000
400 x   1 workers =>  8.631 s, 231.7 tasks/s

$ python3 bar.py 400 1 2000
400 x   1 workers => 10.453 s, 191.3 tasks/s

$ python3 bar.py 400 1 2000
400 x   1 workers =>  8.234 s, 242.9 tasks/s

$ python3 bar.py 400 1 2000
400 x   1 workers =>  8.324 s, 240.3 tasks/s

每个python3工作进程的CPU使用率在1%到3%之间.

The CPU usage is between 1% to 3% for each python3 worker process.

从每个案例中选择中值结果,我们得到以下摘要:

Picking the median result from each case, we get this summary:

Case 1:  20 x  20 workers => 12.224 s, 163.6 tasks/s
Case 2:   4 x 100 workers => 19.896 s, 100.5 tasks/s
Case 3:   1 x 400 workers => 34.762 s,  57.5 tasks/s
Case 4: 400 x   1 workers =>  8.631 s, 231.7 tasks/s

问题

即使我只有4个CPU,为什么20个进程x 20个线程的性能要比4个进程x 100个线程好?

Question

Why does 20 processes x 20 threads perform better than 4 processes x 100 threads even if I have only 4 CPUs?

实际上,尽管只有4个CPU,但400个进程x 1个线程的性能最佳?为什么?

In fact, 400 processes x 1 thread performs the best despite the presence of only 4 CPUs? Why?

推荐答案

在Python线程可以执行代码之前,它需要获取

Before a Python thread can execute code it needs to acquire the Global Interpreter Lock (GIL). This is a per-process lock. In some cases (e.g. when waiting for I/O operations to complete) a thread will routinely release the GIL so other threads can acquire it. If the active thread is not giving up the lock within a certain time other threads can signal the active thread to release the GIL so they can take turns.

考虑到这一点,让我们看看您的代码在我的4核笔记本电脑上的性能如何:

With that in mind let's look at how your code performs on my 4 core laptop:

  1. 在最简单的情况下(具有1个线程的1个进程),我得到〜155个任务/秒. GIL在这里没有妨碍我们.我们使用一个内核的100%.

  1. In the simplest case (1 process with 1 thread) I get ~155 tasks/s. The GIL is not getting in our way here. We use 100% of one core.

如果我增加线程数量(1个进程包含4个线程),我将获得约70个任务/秒.起初这可能是违反直觉的,但可以通过以下事实来解释:您的代码主要是CPU约束的,因此所有线程几乎一直都需要GIL.他们中只有一个可以一次运行它的计算,因此我们无法从多线程中受益.结果是,我们使用了我的4个内核中的每个内核的约25%.更糟的是,获取和发布GIL以及上下文切换会增加大量开销,从而降低整体性能.

If I bump up the number of threads (1 process with 4 threads), I get ~70 tasks/s. This might be counter-intuitive at first but can be explained by the fact that your code is mostly CPU-bound so all threads need the GIL pretty much all the time. Only one of them can run it's computation at a time so we don't benefit from multithreading. The result is that we use ~25% of each of my 4 cores. To make matters worse acquiring and releasing the GIL as well as context switching add significant overhead that bring down overall performance.

添加更多线程(1个进程包含400个线程)无济于事,因为一次只能执行其中一个.在我的笔记本电脑上,性能与情况(2)非常相似,同样,我们使用了4个内核中的每个内核的约25%.

Adding more threads (1 process with 400 threads) doesn't help since only one of them gets executed at a time. On my laptop performance is pretty similar to case (2), again we use ~25% of each of my 4 cores.

具有4个进程(每个进程有1个线程),我得到约550个任务/秒.我的情况几乎是(1)的4​​倍.实际上,由于进程间通信和锁定共享队列所需的开销而导致的开销要少一些.请注意,每个进程都使用自己的GIL.

With 4 processes with 1 thread each, I get ~550 tasks/s. Almost 4 times what I got in case (1). Actually, a little bit less due to overhead required for inter-process communication and locking on the shared queue. Note that each process is using its own GIL.

有4个进程每个运行100个线程,我得到〜290个任务/秒.再次,我们看到了在(2)中看到的速度下降,这次影响了每个单独的过程.

With 4 processes running 100 threads each, I get ~290 tasks/s. Again the we see the slow-down we saw in (2), this time affecting each separate process.

400个进程每个运行1个线程,我得到〜530个任务/秒.与(4)相比,由于进程间通信和共享队列的锁定,我们看到了额外的开销.

With 400 processes running 1 thread each, I get ~530 tasks/s. Compared to (4) we see additional overhead due to inter-process communication and locking on the shared queue.

请参阅 David Beazley的演讲了解Python GIL" ,以了解更多信息.这些效果的深入解释.

Please refer to David Beazley's talk Understanding the Python GIL for a more in-depth explanation of these effects.

注意:某些Python解释器(如CPython和PyPy)具有GIL,而其他诸如Jython和IronPython则没有. .如果您使用其他Python解释器,则可能会看到非常不同的行为.

Note: Some Python interpreters like CPython and PyPy have a GIL while others like Jython and IronPython don't. If you use another Python interpreter you might see very different behavior.

这篇关于20个进程中的400个线程胜过4个进程中的400个线程,同时在4个CPU上执行CPU绑定任务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆