在执行 I/O 密集型任务时，20 个进程中的 400 个线程优于 4 个进程中的 400 个线程 [英] 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task

查看：16 发布时间：2022/1/12 12:38:00 python multithreading performance multiprocessing gil

本文介绍了在执行 I/O 密集型任务时，20 个进程中的 400 个线程优于 4 个进程中的 400 个线程的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

下面是实验代码，它可以启动指定数量的工作进程，然后在每个进程内启动指定数量的工作线程，并执行获取 URL 的任务:

Here is the experimental code that can launch a specified number of worker processes and then launch a specified number of worker threads within each process and perform the task of fetching URLs:

import multiprocessing
import sys
import time
import threading
import urllib.request


def main():
    processes = int(sys.argv[1])
    threads = int(sys.argv[2])
    urls = int(sys.argv[3])

    # Start process workers.
    in_q = multiprocessing.Queue()
    process_workers = []
    for _ in range(processes):
        w = multiprocessing.Process(target=process_worker, args=(threads, in_q))
        w.start()
        process_workers.append(w)

    start_time = time.time()

    # Feed work.
    for n in range(urls):
        in_q.put('http://www.example.com/?n={}'.format(n))

    # Send sentinel for each thread worker to quit.
    for _ in range(processes * threads):
        in_q.put(None)

    # Wait for workers to terminate.
    for w in process_workers:
        w.join()

    # Print time consumed and fetch speed.
    total_time = time.time() - start_time
    fetch_speed = urls / total_time
    print('{} x {} workers => {:.3} s, {:.1f} URLs/s'
          .format(processes, threads, total_time, fetch_speed))



def process_worker(threads, in_q):
    # Start thread workers.
    thread_workers = []
    for _ in range(threads):
        w = threading.Thread(target=thread_worker, args=(in_q,))
        w.start()
        thread_workers.append(w)

    # Wait for thread workers to terminate.
    for w in thread_workers:
        w.join()


def thread_worker(in_q):
    # Each thread performs the actual work. In this case, we will assume
    # that the work is to fetch a given URL.
    while True:
        url = in_q.get()
        if url is None:
            break

        with urllib.request.urlopen(url) as u:
            pass # Do nothing
            # print('{} - {} {}'.format(url, u.getcode(), u.reason))


if __name__ == '__main__':
    main()

这是我运行这个程序的方式:

Here is how I run this program:

python3 foo.py <PROCESSES> <THREADS> <URLS>

例如，python3 foo.py 20 20 10000 创建 20 个工作进程，每个工作进程中有 20 个线程(因此总共有 400 个工作线程)并获取 10000 个 URL.最后，这个程序会打印出获取 URL 所花费的时间以及平均每秒获取多少个 URL.

For example, python3 foo.py 20 20 10000 creates 20 worker processes with 20 threads in each worker process (thus a total of 400 worker threads) and fetches 10000 URLs. In the end, this program prints how much time it took to fetch the URLs and how many URLs it fetched per second on an average.

请注意，在所有情况下，我都会点击 www.example.com 域的 URL，即 www.example.com 不仅仅是一个占位符.换句话说，我在未修改的情况下运行上述代码.

Note that in all cases I am really hitting a URL of www.example.com domain, i.e., www.example.com is not merely a placeholder. In other words, I run the above code unmodified.

我正在一个具有 8 GB RAM 和 4 个 CPU 的 Linode 虚拟专用服务器上测试此代码.它正在运行 Debian 9.

I am testing this code on a Linode virtual private server that has 8 GB RAM and 4 CPUs. It is running Debian 9.

$ cat /etc/debian_version 
9.9

$ python3
Python 3.5.3 (default, Sep 27 2018, 17:25:39) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7987          67        7834          10          85        7734
Swap:           511           0         511

$ nproc
4

案例 1:20 个进程 x 20 个线程

这里有一些试运行，其中 400 个工作线程分布在 20 个工作进程之间(即 20 个工作进程中的每个工作进程有 20 个工作线程).在每次试验中，会提取 10,000 个 URL.

Case 1: 20 Processes x 20 Threads

Here are a few trial runs with 400 worker threads distributed between 20 worker processes (i.e., 20 worker threads in each of the 20 worker processes). In each trial, 10,000 URLs are fetched.

结果如下:

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.12 s, 1954.6 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.28 s, 1895.5 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.22 s, 1914.2 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.38 s, 1859.8 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.19 s, 1925.2 URLs/s

我们可以看到平均每秒获取大约 1900 个 URL.当我使用 top 命令监控 CPU 使用率时，我看到每个 python3 工作进程消耗大约 10% 到 15% 的 CPU.

We can see that about 1900 URLs are fetched per second on an average. When I monitor the CPU usage with the top command, I see that each python3 worker process consumes about 10% to 15% CPU.

现在我以为我只有 4 个 CPU.即使我启动 20 个工作进程，在物理时间的任何时间点最多也只有 4 个进程可以运行.此外，由于全局解释器锁 (GIL)，每个进程中只有一个线程(因此最多总共 4 个线程)可以在物理时间的任何点运行.

Now I thought that I only have 4 CPUs. Even if I launch 20 worker processes, at most only 4 processes can run at any point in physical time. Further due to global interpreter lock (GIL), only one thread in each process (thus a total of 4 threads at most) can run at any point in physical time.

因此，我想如果我将进程数减少到4个，并将每个进程的线程数增加到100个，这样总线程数仍然保持在400个，性能应该不会变差.

Therefore, I thought if I reduce the number of processes to 4 and increase the number of threads per process to 100, so that the total number of threads still remain 400, the performance should not deteriorate.

但测试结果表明，每个包含 100 个线程的 4 个进程的性能始终比每个包含 20 个线程的 20 个进程差.

But the test results show that 4 processes containing 100 threads each consistently perform worse than 20 processes containing 20 threads each.

$ python3 foo.py 4 100 10000
4 x 100 workers => 9.2 s, 1086.4 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.9 s, 916.5 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 7.8 s, 1282.2 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.3 s, 972.3 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 6.37 s, 1570.9 URLs/s

每个 python3 工作进程的 CPU 使用率在 40% 到 60% 之间.

The CPU usage is between 40% to 60% for each python3 worker process.

只是为了比较，我记录了一个事实，即案例 1 和案例 2 都优于我们在单个进程中拥有所有 400 个线程的情况.这肯定是由于全局解释器锁 (GIL).

Just for comparison, I am recording the fact that both case 1 and case 2 outperform the case where we have all 400 threads in a single process. This is most certainly due to the global interpreter lock (GIL).

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.5 s, 742.8 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 14.3 s, 697.5 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 761.3 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 15.6 s, 640.4 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 764.4 URLs/s

单个 python3 工作进程的 CPU 使用率介于 120% 和 125% 之间.

The CPU usage is between 120% and 125% for the single python3 worker process.

再次，只是为了比较，这里是当有 400 个进程时的结果，每个进程都有一个线程.

Again, just for comparison, here is how the results look when there are 400 processes, each with a single thread.

$ python3 foo.py 400 1 10000
400 x 1 workers => 14.0 s, 715.0 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 6.1 s, 1638.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.08 s, 1413.1 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.23 s, 1382.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 11.3 s, 882.9 URLs/s

每个 python3 工作进程的 CPU 使用率在 1% 到 3% 之间.

The CPU usage is between 1% to 3% for each python3 worker process.

从每个案例中选取中值结果，我们得到以下摘要:

Picking the median result from each case, we get this summary:

Case 1:  20 x  20 workers => 5.22 s, 1914.2 URLs/s ( 10% to  15% CPU/process)
Case 2:   4 x 100 workers => 9.20 s, 1086.4 URLs/s ( 40% to  60% CPU/process)
Case 3:   1 x 400 workers => 13.5 s,  742.8 URLs/s (120% to 125% CPU/process)
Case 4: 400 x   1 workers => 7.23 s, 1382.9 URLs/s (  1% to   3% CPU/process

问题

为什么即使我只有 4 个 CPU，20 进程 x 20 线程的性能也比 4 进程 x 100 线程好?

Question

Why does 20 processes x 20 threads perform better than 4 processes x 100 threads even if I have only 4 CPUs?

在执行 I/O 密集型任务时，20 个进程中的 400 个线程优于 4 个进程中的 400 个线程 [英] 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task

问题描述

案例 1:20 个进程 x 20 个线程

Case 1: 20 Processes x 20 Threads

问题

Question

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在执行 I/O 密集型任务时，20 个进程中的 400 个线程优于 4 个进程中的 400 个线程 [英] 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task

问题描述

案例 1:20 个进程 x 20 个线程

Case 1: 20 Processes x 20 Threads

问题

Question

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭