20个进程中的400个线程在执行I/O绑定任务时胜过4个进程中的400个线程 [英] 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task

查看：74 发布时间：2020/5/13 19:28:58 python multithreading performance multiprocessing gil

本文介绍了20个进程中的400个线程在执行I/O绑定任务时胜过4个进程中的400个线程的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

以下是实验代码，可以启动指定数量的工作进程，然后在每个进程中启动指定数量的工作线程，并执行获取URL的任务:

Here is the experimental code that can launch a specified number of worker processes and then launch a specified number of worker threads within each process and perform the task of fetching URLs:

import multiprocessing
import sys
import time
import threading
import urllib.request


def main():
    processes = int(sys.argv[1])
    threads = int(sys.argv[2])
    urls = int(sys.argv[3])

    # Start process workers.
    in_q = multiprocessing.Queue()
    process_workers = []
    for _ in range(processes):
        w = multiprocessing.Process(target=process_worker, args=(threads, in_q))
        w.start()
        process_workers.append(w)

    start_time = time.time()

    # Feed work.
    for n in range(urls):
        in_q.put('http://www.example.com/?n={}'.format(n))

    # Send sentinel for each thread worker to quit.
    for _ in range(processes * threads):
        in_q.put(None)

    # Wait for workers to terminate.
    for w in process_workers:
        w.join()

    # Print time consumed and fetch speed.
    total_time = time.time() - start_time
    fetch_speed = urls / total_time
    print('{} x {} workers => {:.3} s, {:.1f} URLs/s'
          .format(processes, threads, total_time, fetch_speed))



def process_worker(threads, in_q):
    # Start thread workers.
    thread_workers = []
    for _ in range(threads):
        w = threading.Thread(target=thread_worker, args=(in_q,))
        w.start()
        thread_workers.append(w)

    # Wait for thread workers to terminate.
    for w in thread_workers:
        w.join()


def thread_worker(in_q):
    # Each thread performs the actual work. In this case, we will assume
    # that the work is to fetch a given URL.
    while True:
        url = in_q.get()
        if url is None:
            break

        with urllib.request.urlopen(url) as u:
            pass # Do nothing
            # print('{} - {} {}'.format(url, u.getcode(), u.reason))


if __name__ == '__main__':
    main()

这是我运行此程序的方式:

Here is how I run this program:

python3 foo.py <PROCESSES> <THREADS> <URLS>

例如，python3 foo.py 20 20 10000创建20个工作进程，每个工作进程中有20个线程(因此总共有400个工作线程)，并获取10000个URL.最后，该程序将打印获取URL所需的时间以及平均每秒获取的URL数量.

For example, python3 foo.py 20 20 10000 creates 20 worker processes with 20 threads in each worker process (thus a total of 400 worker threads) and fetches 10000 URLs. In the end, this program prints how much time it took to fetch the URLs and how many URLs it fetched per second on an average.

请注意，在所有情况下，我实际上都命中了www.example.com域的URL，即www.example.com不仅仅是占位符.换句话说，我未修改地运行了上面的代码.

Note that in all cases I am really hitting a URL of www.example.com domain, i.e., www.example.com is not merely a placeholder. In other words, I run the above code unmodified.

我正在具有8 GB RAM和4个CPU的Linode虚拟专用服务器上测试此代码.它正在运行Debian 9.

I am testing this code on a Linode virtual private server that has 8 GB RAM and 4 CPUs. It is running Debian 9.

$ cat /etc/debian_version 
9.9

$ python3
Python 3.5.3 (default, Sep 27 2018, 17:25:39) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7987          67        7834          10          85        7734
Swap:           511           0         511

$ nproc
4

情况1:20个进程x 20个线程

这里有一些试验运行，其中400个工作线程分布在20个工作进程之间(即20个工作进程中的每个工作进程中有20个工作线程).在每个试验中，都会提取10,000个URL.

Case 1: 20 Processes x 20 Threads

Here are a few trial runs with 400 worker threads distributed between 20 worker processes (i.e., 20 worker threads in each of the 20 worker processes). In each trial, 10,000 URLs are fetched.

以下是结果:

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.12 s, 1954.6 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.28 s, 1895.5 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.22 s, 1914.2 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.38 s, 1859.8 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.19 s, 1925.2 URLs/s

我们可以看到平均平均每秒获取1900个URL.当我使用top命令监视CPU使用率时，我发现每个python3工作进程都消耗大约10％到15％的CPU.

We can see that about 1900 URLs are fetched per second on an average. When I monitor the CPU usage with the top command, I see that each python3 worker process consumes about 10% to 15% CPU.

现在我认为我只有4个CPU.即使我启动了20个工作进程，在物理时间的任何时候最多也只能运行4个进程.此外，由于全局解释器锁(GIL)，每个进程中只能有一个线程(因此最多四个线程)可以在物理时间的任何时间运行.

Now I thought that I only have 4 CPUs. Even if I launch 20 worker processes, at most only 4 processes can run at any point in physical time. Further due to global interpreter lock (GIL), only one thread in each process (thus a total of 4 threads at most) can run at any point in physical time.

因此，我认为如果将进程数减少到4，并将每个进程的线程数增加到100，以使线程总数仍然保持400，则性能应该不会降低.

Therefore, I thought if I reduce the number of processes to 4 and increase the number of threads per process to 100, so that the total number of threads still remain 400, the performance should not deteriorate.

但是测试结果显示，每个包含100个线程的4个进程的性能始终比每个包含20个线程的20个进程的性能差.

But the test results show that 4 processes containing 100 threads each consistently perform worse than 20 processes containing 20 threads each.

$ python3 foo.py 4 100 10000
4 x 100 workers => 9.2 s, 1086.4 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.9 s, 916.5 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 7.8 s, 1282.2 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.3 s, 972.3 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 6.37 s, 1570.9 URLs/s

每个python3工作进程的CPU使用率在40％到60％之间.

The CPU usage is between 40% to 60% for each python3 worker process.

为了进行比较，我记录了一个事实，即情况1和情况2均优于单个进程中所有400个线程的情况.这肯定是由于全局解释器锁定(GIL).

Just for comparison, I am recording the fact that both case 1 and case 2 outperform the case where we have all 400 threads in a single process. This is most certainly due to the global interpreter lock (GIL).

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.5 s, 742.8 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 14.3 s, 697.5 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 761.3 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 15.6 s, 640.4 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 764.4 URLs/s

单个python3工作进程的CPU使用率在120％到125％之间.

The CPU usage is between 120% and 125% for the single python3 worker process.

同样，为了比较，这是当有400个进程(每个进程都有一个线程)时结果的样子.

Again, just for comparison, here is how the results look when there are 400 processes, each with a single thread.

$ python3 foo.py 400 1 10000
400 x 1 workers => 14.0 s, 715.0 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 6.1 s, 1638.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.08 s, 1413.1 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.23 s, 1382.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 11.3 s, 882.9 URLs/s

每个python3工作进程的CPU使用率在1％到3％之间.

The CPU usage is between 1% to 3% for each python3 worker process.

从每个案例中选择中值结果，我们得到以下摘要:

Picking the median result from each case, we get this summary:

Case 1:  20 x  20 workers => 5.22 s, 1914.2 URLs/s ( 10% to  15% CPU/process)
Case 2:   4 x 100 workers => 9.20 s, 1086.4 URLs/s ( 40% to  60% CPU/process)
Case 3:   1 x 400 workers => 13.5 s,  742.8 URLs/s (120% to 125% CPU/process)
Case 4: 400 x   1 workers => 7.23 s, 1382.9 URLs/s (  1% to   3% CPU/process

问题

即使我只有4个CPU，为什么20个进程x 20个线程的性能要比4个进程x 100个线程好?

Question

Why does 20 processes x 20 threads perform better than 4 processes x 100 threads even if I have only 4 CPUs?

20个进程中的400个线程在执行I/O绑定任务时胜过4个进程中的400个线程 [英] 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task

问题描述

情况1:20个进程x 20个线程

Case 1: 20 Processes x 20 Threads

问题

Question

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

20个进程中的400个线程在执行I/O绑定任务时胜过4个进程中的400个线程 [英] 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task

问题描述

情况1:20个进程x 20个线程

Case 1: 20 Processes x 20 Threads

问题

Question

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭