20个进程中的400个线程在执行I/O绑定任务时胜过4个进程中的400个线程 [英] 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task

查看:74
本文介绍了20个进程中的400个线程在执行I/O绑定任务时胜过4个进程中的400个线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是实验代码,可以启动指定数量的工作进程,然后在每个进程中启动指定数量的工作线程,并执行获取URL的任务:

Here is the experimental code that can launch a specified number of worker processes and then launch a specified number of worker threads within each process and perform the task of fetching URLs:

import multiprocessing
import sys
import time
import threading
import urllib.request


def main():
    processes = int(sys.argv[1])
    threads = int(sys.argv[2])
    urls = int(sys.argv[3])

    # Start process workers.
    in_q = multiprocessing.Queue()
    process_workers = []
    for _ in range(processes):
        w = multiprocessing.Process(target=process_worker, args=(threads, in_q))
        w.start()
        process_workers.append(w)

    start_time = time.time()

    # Feed work.
    for n in range(urls):
        in_q.put('http://www.example.com/?n={}'.format(n))

    # Send sentinel for each thread worker to quit.
    for _ in range(processes * threads):
        in_q.put(None)

    # Wait for workers to terminate.
    for w in process_workers:
        w.join()

    # Print time consumed and fetch speed.
    total_time = time.time() - start_time
    fetch_speed = urls / total_time
    print('{} x {} workers => {:.3} s, {:.1f} URLs/s'
          .format(processes, threads, total_time, fetch_speed))



def process_worker(threads, in_q):
    # Start thread workers.
    thread_workers = []
    for _ in range(threads):
        w = threading.Thread(target=thread_worker, args=(in_q,))
        w.start()
        thread_workers.append(w)

    # Wait for thread workers to terminate.
    for w in thread_workers:
        w.join()


def thread_worker(in_q):
    # Each thread performs the actual work. In this case, we will assume
    # that the work is to fetch a given URL.
    while True:
        url = in_q.get()
        if url is None:
            break

        with urllib.request.urlopen(url) as u:
            pass # Do nothing
            # print('{} - {} {}'.format(url, u.getcode(), u.reason))


if __name__ == '__main__':
    main()

这是我运行此程序的方式:

Here is how I run this program:

python3 foo.py <PROCESSES> <THREADS> <URLS>

例如,python3 foo.py 20 20 10000创建20个工作进程,每个工作进程中有20个线程(因此总共有400个工作线程),并获取10000个URL.最后,该程序将打印获取URL所需的时间以及平均每秒获取的URL数量.

For example, python3 foo.py 20 20 10000 creates 20 worker processes with 20 threads in each worker process (thus a total of 400 worker threads) and fetches 10000 URLs. In the end, this program prints how much time it took to fetch the URLs and how many URLs it fetched per second on an average.

请注意,在所有情况下,我实际上都命中了www.example.com域的URL,即www.example.com不仅仅是占位符.换句话说,我未修改地运行了上面的代码.

Note that in all cases I am really hitting a URL of www.example.com domain, i.e., www.example.com is not merely a placeholder. In other words, I run the above code unmodified.

我正在具有8 GB RAM和4个CPU的Linode虚拟专用服务器上测试此代码.它正在运行Debian 9.

I am testing this code on a Linode virtual private server that has 8 GB RAM and 4 CPUs. It is running Debian 9.

$ cat /etc/debian_version 
9.9

$ python3
Python 3.5.3 (default, Sep 27 2018, 17:25:39) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7987          67        7834          10          85        7734
Swap:           511           0         511

$ nproc
4

情况1:20个进程x 20个线程

这里有一些试验运行,其中400个工作线程分布在20个工作进程之间(即20个工作进程中的每个工作进程中有20个工作线程).在每个试验中,都会提取10,000个URL.

Case 1: 20 Processes x 20 Threads

Here are a few trial runs with 400 worker threads distributed between 20 worker processes (i.e., 20 worker threads in each of the 20 worker processes). In each trial, 10,000 URLs are fetched.

以下是结果:

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.12 s, 1954.6 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.28 s, 1895.5 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.22 s, 1914.2 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.38 s, 1859.8 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.19 s, 1925.2 URLs/s

我们可以看到平均平均每秒获取1900个URL.当我使用top命令监视CPU使用率时,我发现每个python3工作进程都消耗大约10%到15%的CPU.

We can see that about 1900 URLs are fetched per second on an average. When I monitor the CPU usage with the top command, I see that each python3 worker process consumes about 10% to 15% CPU.

现在我认为我只有4个CPU.即使我启动了20个工作进程,在物理时间的任何时候最多也只能运行4个进程.此外,由于全局解释器锁(GIL),每个进程中只能有一个线程(因此最多四个线程)可以在物理时间的任何时间运行.

Now I thought that I only have 4 CPUs. Even if I launch 20 worker processes, at most only 4 processes can run at any point in physical time. Further due to global interpreter lock (GIL), only one thread in each process (thus a total of 4 threads at most) can run at any point in physical time.

因此,我认为如果将进程数减少到4,并将每个进程的线程数增加到100,以使线程总数仍然保持400,则性能应该不会降低.

Therefore, I thought if I reduce the number of processes to 4 and increase the number of threads per process to 100, so that the total number of threads still remain 400, the performance should not deteriorate.

但是测试结果显示,每个包含100个线程的4个进程的性能始终比每个包含20个线程的20个进程的性能差.

But the test results show that 4 processes containing 100 threads each consistently perform worse than 20 processes containing 20 threads each.

$ python3 foo.py 4 100 10000
4 x 100 workers => 9.2 s, 1086.4 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.9 s, 916.5 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 7.8 s, 1282.2 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.3 s, 972.3 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 6.37 s, 1570.9 URLs/s

每个python3工作进程的CPU使用率在40%到60%之间.

The CPU usage is between 40% to 60% for each python3 worker process.

为了进行比较,我记录了一个事实,即情况1和情况2均优于单个进程中所有400个线程的情况.这肯定是由于全局解释器锁定(GIL).

Just for comparison, I am recording the fact that both case 1 and case 2 outperform the case where we have all 400 threads in a single process. This is most certainly due to the global interpreter lock (GIL).

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.5 s, 742.8 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 14.3 s, 697.5 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 761.3 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 15.6 s, 640.4 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 764.4 URLs/s

单个python3工作进程的CPU使用率在120%到125%之间.

The CPU usage is between 120% and 125% for the single python3 worker process.

同样,为了比较,这是当有400个进程(每个进程都有一个线程)时结果的样子.

Again, just for comparison, here is how the results look when there are 400 processes, each with a single thread.

$ python3 foo.py 400 1 10000
400 x 1 workers => 14.0 s, 715.0 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 6.1 s, 1638.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.08 s, 1413.1 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.23 s, 1382.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 11.3 s, 882.9 URLs/s

每个python3工作进程的CPU使用率在1%到3%之间.

The CPU usage is between 1% to 3% for each python3 worker process.

从每个案例中选择中值结果,我们得到以下摘要:

Picking the median result from each case, we get this summary:

Case 1:  20 x  20 workers => 5.22 s, 1914.2 URLs/s ( 10% to  15% CPU/process)
Case 2:   4 x 100 workers => 9.20 s, 1086.4 URLs/s ( 40% to  60% CPU/process)
Case 3:   1 x 400 workers => 13.5 s,  742.8 URLs/s (120% to 125% CPU/process)
Case 4: 400 x   1 workers => 7.23 s, 1382.9 URLs/s (  1% to   3% CPU/process

问题

即使我只有4个CPU,为什么20个进程x 20个线程的性能要比4个进程x 100个线程好?

Question

Why does 20 processes x 20 threads perform better than 4 processes x 100 threads even if I have only 4 CPUs?

推荐答案

您的任务是受I/O约束的,而不是受CPU约束的:线程大部分时间都在睡眠状态下等待网络数据等,而不是使用CPU.

Your task is I/O-bound rather than CPU-bound: threads spend most of the time in sleep state waiting for network data and such rather than using the CPU.

因此,只要I/O仍然是瓶颈,在这里可以添加比CPU多的线程.只有线程数量太多,一次足以准备好开始主动竞争CPU周期时(或者当您的网络带宽耗尽时,以先到者为准),这种影响才会消退.

So adding more threads than CPUs works here as long as I/O is still the bottleneck. The effect will only subside once there are so many threads that enough of them are ready at a time to start actively competing for CPU cycles (or when your network bandwidth is exhausted, whichever comes first).

为什么每个进程20个线程快于每个进程100个线程:这很可能是由于CPython的GIL所致.同一进程中的Python线程不仅需要等待I/O,还需要彼此等待.
在处理I/O时,Python机制:

As for why 20 threads per process is faster than 100 threads per process: this is most likely due to CPython's GIL. Python threads in the same process need to wait not only for I/O but for each other, too.
When dealing with I/O, Python machinery:

  1. 将所有涉及的Python对象转换为C对象(在很多情况下,无需物理复制数据即可完成此操作)
  2. 发布GIL
  3. 在C中执行I/O(涉及等待任意时间)
  4. 获取GIL
  5. 将结果转换为Python对象(如果适用)

如果同一进程中有足够的线程,则到达第4步时,另一个线程处于活动状态的可能性就会增加,从而导致额外的随机延迟.

If there are enough threads in the same process, it becomes increasigly likely that another one is active when step 4 is reached, causing an additional random delay.

现在,当涉及到许多进程时,其他因素也会像内存交换一样起作用(因为与线程不同,因为运行相同代码的进程不会共享内存)(我很确定其他很多因素也会导致延迟)进程而不是线程争夺资源,但无法从头顶上指出这一点).这就是性能变得不稳定的原因.

Now, when it comes to lots of processes, other factors come into play like memory swapping (since unlike threads, processes running the same code don't share memory) (I'm pretty sure there are other delays from lots of processes as opposed to threads competing for resources but can't point it from the top of my head). That's why the performance becomes unstable.

这篇关于20个进程中的400个线程在执行I/O绑定任务时胜过4个进程中的400个线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆