为什么此Python脚本在多个内核上的运行速度比单个内核慢4倍 [英] Why does this Python script run 4x slower on multiple cores than on a single core

查看:112
本文介绍了为什么此Python脚本在多个内核上的运行速度比单个内核慢4倍的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解CPython的GIL的工作原理,以及CPython 2.7.x和CPython 3.4.x中GIL之间的区别.我正在使用以下代码进行基准测试:

from __future__ import print_function

import argparse
import resource
import sys
import threading
import time


def countdown(n):
    while n > 0:
        n -= 1


def get_time():
    stats = resource.getrusage(resource.RUSAGE_SELF)
    total_cpu_time = stats.ru_utime + stats.ru_stime
    return time.time(), total_cpu_time, stats.ru_utime, stats.ru_stime


def get_time_diff(start_time, end_time):
    return tuple((end-start) for start, end in zip(start_time, end_time))


def main(total_cycles, max_threads, no_headers=False):
    header = ("%4s %8s %8s %8s %8s %8s %8s %8s %8s" %
              ("#t", "seq_r", "seq_c", "seq_u", "seq_s",
               "par_r", "par_c", "par_u", "par_s"))
    row_format = ("%(threads)4d "
                  "%(seq_r)8.2f %(seq_c)8.2f %(seq_u)8.2f %(seq_s)8.2f "
                  "%(par_r)8.2f %(par_c)8.2f %(par_u)8.2f %(par_s)8.2f")
    if not no_headers:
        print(header)
    for thread_count in range(1, max_threads+1):
        # We don't care about a few lost cycles
        cycles = total_cycles // thread_count

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]

        start_time = get_time()
        for thread in threads:
            thread.start()
            thread.join()
        end_time = get_time()
        sequential = get_time_diff(start_time, end_time)

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]
        start_time = get_time()
        for thread in threads:
            thread.start()
        for thread in threads:
            thread.join()
        end_time = get_time()
        parallel = get_time_diff(start_time, end_time)

        print(row_format % {"threads": thread_count,
                            "seq_r": sequential[0],
                            "seq_c": sequential[1],
                            "seq_u": sequential[2],
                            "seq_s": sequential[3],
                            "par_r": parallel[0],
                            "par_c": parallel[1],
                            "par_u": parallel[2],
                            "par_s": parallel[3]})


if __name__ == "__main__":
    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument("max_threads", nargs="?",
                            type=int, default=5)
    arg_parser.add_argument("total_cycles", nargs="?",
                            type=int, default=50000000)
    arg_parser.add_argument("--no-headers",
                            action="store_true")
    args = arg_parser.parse_args()
    sys.exit(main(args.total_cycles, args.max_threads, args.no_headers))

在装有Python 2.7.6的Ubuntu 14.04下的四核i5-2500计算机上运行此脚本时,我得到以下结果(_r代表实时,_c代表CPU时间,_u代表用户模式,_s代表内核.模式):

  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.47     1.47     1.47     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.74     0.00     3.33     5.45     3.52     1.93
   3     1.87     1.90     1.90     0.00     3.08     6.42     3.77     2.65
   4     1.78     1.83     1.83     0.00     3.73     6.18     3.88     2.30
   5     1.73     1.79     1.79     0.00     3.74     6.26     3.87     2.39

现在,如果我将所有线程绑定到一个内核,结果将非常不同:

taskset -c 0 python countdown.py 
  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.46     1.46     1.46     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.73     0.00     1.69     1.68     1.68     0.00
   3     1.47     1.47     1.47     0.00     1.58     1.58     1.54     0.04
   4     1.74     1.74     1.74     0.00     2.02     2.02     1.87     0.15
   5     1.46     1.46     1.46     0.00     1.91     1.90     1.75     0.15

所以问题是:为什么在多核上运行此Python代码比在单核上运行慢1.5倍至2倍,挂钟慢4倍至5倍?

四处逛逛和谷歌搜索产生两个假设:

  1. 在多个内核上运行时,可以将线程重新安排为在另一个内核上运行,这意味着本地缓存无效,因此速度变慢.
  2. 在一个内核上挂起线程并在另一个内核上激活该线程的开销要比在同一内核上挂起并激活线程的开销大.

还有其他原因吗?我想了解发生了什么,并能够用数字来支持我的理解(这意味着,如果速度下降是由于缓存未命中引起的,我想查看并比较这两种情况的数字).

解决方案

这是由于当多个本机线程竞争GIL时GIL崩溃.大卫·比兹利(David Beazley)在该主题上的材料将告诉您您想知道的一切.

有关正在发生的事情的漂亮图形表示,请参见此处的信息. /p>

Python3.2引入了对GIL的更改,这些更改有助于解决此问题,因此您应该在3.2及更高版本中看到更高的性能.

还应注意,GIL是该语言的cpython参考实现的实现细节.诸如Jython之类的其他实现没有GIL,也不会遇到这个特定问题.

D. Beazley在GIL上的其余信息也将为您提供帮助.

要具体回答有关为什么涉及多个内核时性能如此差的问题,请参见在GIL演示中.它将详细讨论多核GIL争用,而不是单个核上有多个线程.幻灯片32特别显示了由于线程信令开销而导致的系统调用数量在添加内核时不断增加.这是因为线程现在同时在不同的内核上运行,这使它们可以参与真正的GIL争夺战.与多个线程共享一个CPU相反.上述演示的一个很好的摘要项目符号是:

使用多个内核,可同时调度与CPU绑定的线程 (在不同的内核上),然后进行GIL战斗.

I'm trying to understand how CPython's GIL works and what are the differences between GIL in CPython 2.7.x and CPython 3.4.x. I'm using this code for benchmarking:

from __future__ import print_function

import argparse
import resource
import sys
import threading
import time


def countdown(n):
    while n > 0:
        n -= 1


def get_time():
    stats = resource.getrusage(resource.RUSAGE_SELF)
    total_cpu_time = stats.ru_utime + stats.ru_stime
    return time.time(), total_cpu_time, stats.ru_utime, stats.ru_stime


def get_time_diff(start_time, end_time):
    return tuple((end-start) for start, end in zip(start_time, end_time))


def main(total_cycles, max_threads, no_headers=False):
    header = ("%4s %8s %8s %8s %8s %8s %8s %8s %8s" %
              ("#t", "seq_r", "seq_c", "seq_u", "seq_s",
               "par_r", "par_c", "par_u", "par_s"))
    row_format = ("%(threads)4d "
                  "%(seq_r)8.2f %(seq_c)8.2f %(seq_u)8.2f %(seq_s)8.2f "
                  "%(par_r)8.2f %(par_c)8.2f %(par_u)8.2f %(par_s)8.2f")
    if not no_headers:
        print(header)
    for thread_count in range(1, max_threads+1):
        # We don't care about a few lost cycles
        cycles = total_cycles // thread_count

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]

        start_time = get_time()
        for thread in threads:
            thread.start()
            thread.join()
        end_time = get_time()
        sequential = get_time_diff(start_time, end_time)

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]
        start_time = get_time()
        for thread in threads:
            thread.start()
        for thread in threads:
            thread.join()
        end_time = get_time()
        parallel = get_time_diff(start_time, end_time)

        print(row_format % {"threads": thread_count,
                            "seq_r": sequential[0],
                            "seq_c": sequential[1],
                            "seq_u": sequential[2],
                            "seq_s": sequential[3],
                            "par_r": parallel[0],
                            "par_c": parallel[1],
                            "par_u": parallel[2],
                            "par_s": parallel[3]})


if __name__ == "__main__":
    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument("max_threads", nargs="?",
                            type=int, default=5)
    arg_parser.add_argument("total_cycles", nargs="?",
                            type=int, default=50000000)
    arg_parser.add_argument("--no-headers",
                            action="store_true")
    args = arg_parser.parse_args()
    sys.exit(main(args.total_cycles, args.max_threads, args.no_headers))

When running this script on my quad-core i5-2500 machine under Ubuntu 14.04 with Python 2.7.6, I get the following results (_r stands for real time, _c for CPU time, _u for user mode, _s for kernel mode):

  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.47     1.47     1.47     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.74     0.00     3.33     5.45     3.52     1.93
   3     1.87     1.90     1.90     0.00     3.08     6.42     3.77     2.65
   4     1.78     1.83     1.83     0.00     3.73     6.18     3.88     2.30
   5     1.73     1.79     1.79     0.00     3.74     6.26     3.87     2.39

Now if I bind all threads to one core, the results are very different:

taskset -c 0 python countdown.py 
  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.46     1.46     1.46     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.73     0.00     1.69     1.68     1.68     0.00
   3     1.47     1.47     1.47     0.00     1.58     1.58     1.54     0.04
   4     1.74     1.74     1.74     0.00     2.02     2.02     1.87     0.15
   5     1.46     1.46     1.46     0.00     1.91     1.90     1.75     0.15

So the question is: why running this Python code on multiple cores is 1.5x-2x slower by wall clock and 4x-5x slower by CPU clock than running it on a single core?

Asking around and googling produced two hypotheses:

  1. When running on multiple cores, a thread can be re-scheduled to run on a different core which means that local cache gets invalidated, hence the slowdown.
  2. The overhead of suspending a thread on one core and activating it on another core is larger than suspending and activating the thread on the same core.

Are there any other reasons? I would like to understand what's going on and to be able to back my understanding with numbers (meaning that if the slowdown is due to cache misses, I want to see and compare the numbers for both cases).

解决方案

It's due to GIL thrashing when multiple native threads are competing for the GIL. David Beazley's materials on this subject will tell your everything you want to know.

See info here for a nice graphical representation of what is happening.

Python3.2 introduced changes to the GIL that help solve this problem so you should see improved performance with 3.2 and later.

It should also be noted that the GIL is an implementation detail of the cpython reference implementation of the language. Other implementations like Jython do not have GIL and do not suffer this particular problem.

The rest of D. Beazley's info on the GIL will also be helpful to you.

To specifically answer your question about why performance is so much worse when multiple cores are involved, see slide 29-41 of the Inside the GIL presentation. It goes into a detailed discussion on multicore GIL contention as opposed to multiple threads on a single core. Slide 32 specifically shows that the number of system calls due to thread signaling overhead goes through the roof as you add cores. This is because the threads are now running simulatneously on different cores and which allows them to engage in a true GIL battle. As opposed to multiple threads sharing a single CPU. A good summary bullet from the above presentation is:

With multiple cores, CPU-bound threads get scheduled simultaneously (on different cores) and then have a GIL battle.

这篇关于为什么此Python脚本在多个内核上的运行速度比单个内核慢4倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆