为什么多处理会减慢嵌套的 for 循环? [英] Why does multi-processing slow down a nested for loop?

查看:84
本文介绍了为什么多处理会减慢嵌套的 for 循环?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多非常大的矩阵 AFeatures,我将它们与其他一些非常大的矩阵 BFeatures 进行比较,它们的形状都是 (878, 2, 4, 15, 17, 512),使用欧几里得距离.我正在尝试并行化此过程以加快比较速度.我在 Conda 环境中使用 Python 3,而我的原始代码以 100% 的速度平均使用两个 CPU 内核:

I have a lot of very large matrices AFeatures that I am comparing against some other very large matrices BFeatures, both of which have a shape of (878, 2, 4, 15, 17, 512), using the Euclidean distance. I am trying to parallelise this process to speed up the comparison. I am using Python 3 in a Conda environment and my original code uses an average of two CPU cores at 100%:

    per_slice_comparisons = np.zeros(shape=(878, 878, 2, 4))
    
    for i in range(878):
        for j in range(878):
            for k in range(2):
                for l in range(4):
                    per_slice_comparisons[i, j, k, l] = np.linalg.norm(AFeatures[i, k, l, :] - BFeatures[j, k, l, :])

我尝试了两种加速代码的方法.

I have tried two approaches for speeding up the code.

  1. 使用多处理

  1. Using multi-processing

def fill_array(i):
    comparisons = np.zeros(shape=(878, 2, 4))

    for j in range(878):
        for k in range(2):
            for l in range(4):
                comparisons[j, k, l] = np.linalg.norm(AFeatures[i, k, l, :] -BFeatures[j, k, l, :])
         comparisons[j, k, l] = 0

         return comparisons

pool = Pool(processes=6)
list_start_vals = range(878)
per_slice_comparisons = np.array(pool.map(fill_array, list_start_vals))
pool.close()

这种方法将运行时间增加了大约 5%,尽管现在所有 8 个 CPU 内核都以 100% 的利用率使用.我尝试了许多不同的过程,越多越慢.

This approach increases run time by around 5%, although all 8 CPU cores are now being used at 100%. I have tried a number of different processes, the more there are the slower it gets.

  1. 这是一种稍微不同的方法,我使用 numexpr 库来执行更快的 linal.norm 操作.对于单个操作,这种方法将运行时间减少了 10 倍.

  1. This is a slightly different approach where I use the numexpr library to do a faster linal.norm operation. For a single operation this approach reduces runtime by a factor of 10.

 os.environ['NUMEXPR_MAX_THREADS'] = '8'
 os.environ['NUMEXPR_NUM_THREADS'] = '4'
 import numexpr as ne

 def linalg_norm(a):
     sq_norm = ne.evaluate('sum(a**2)')
     return ne.evaluate('sqrt(sq_norm)')

 per_slice_comparisons = np.zeros(shape=(878, 878, 2, 4))
     for i in range(878):
         for j in range(878):
             for k in range(2):
                 for l in range(4):
                     per_slice_comparisons[i, j, k, l] = linalg_norm(AFeatures[i, k, l, :] - BFeatures[j, k, l, :])

但是,对于嵌套的 for 循环,这种方法将总执行时间增加了 3 倍.我不明白为什么简单地将此操作放入嵌套的 for 循环中会如此显着地降低性能?如果有人对如何解决此问题有任何想法,我将不胜感激!

However, for a nested for loop this approach increases total execution time by a factor of 3. I don't understand why simply putting this operation in a nested for loop would decrease performance so dramatically? If anyone has any ideas on how to fix this I would really appreciate it!

推荐答案

为什么多处理会减慢 Python 中的嵌套 for 循环?

Why does multi-processing slow down a nested for loop in python?

创建进程是一项非常昂贵的系统操作.操作系统必须重新映射大量页面(程序、共享库、数据等),以便新创建的进程可以访问初始进程的页面.multiprocessing 包还使用进程间通信来共享进程之间的工作.这也很慢.更不用说所需的最终连接操作了.为了提高效率(即尽可能减少开销),使用 multiprocessing 包的 Python 程序应该共享少量数据并执行昂贵的计算.在您的情况下,您不需要 multiprocessing 包,因为您只使用 Numpy 数组(见下文).

Creating a process is a very expensive system operation. The operating system has to remap a lot of pages (program, shared library, data, etc.) so that the newly created processes can access to the ones of the initial process. The multiprocessing package also use inter-process communication in order to share the work between processes. This is slow too. Not to mention the required final join operation. To be efficient (ie. reduce the overheads as much as possible), a Python program using the multiprocessing package should share a small amount of data and perform expensive computations. In your case, you do not need the multiprocessing package since you use only Numpy arrays (see later).

这是一种稍微不同的方法,我使用 numexpr 库来执行更快的 linal.norm 操作.对于单个操作,这种方法将运行时间减少了 10 倍.

This is a slightly different approach where I use the numexpr library to do a faster linal.norm operation. For a single operation this approach reduces runtime by a factor of 10.

Numexpr 使用 线程 而不是进程,线程与进程相比是轻量级的(即更便宜).Numexpr 还使用积极优化来尽可能地加快计算表达式的速度(CPython 不这样做).

Numexpr use threads rather then processes and threads are light compared to processes (ie. less expensive). Numexpr also uses aggressive optimization to speed up the evaluated expression as much as possible (something CPython does not do).

我不明白为什么简单地将此操作放在嵌套的 for 循环中会如此显着地降低性能?

I don't understand why simply putting this operation in a nested for loop would decrease performance so dramatically?

Python 的默认实现是 CPython,带有一个解释器.解释器通常很慢(尤其是 CPython).CPython 几乎不执行代码优化.如果您想要快速循环,那么您需要将它们编译为本机代码JIT的替代方法.为此,您可以使用 CythonNumba.这两者可以提供并行化程序的简单方法.在您的情况下,使用 Numba 可能是最简单的解决方案.您可以先查看示例程序.

The default implementation of Python is CPython with is an interpreter. Interpreters are generally very slow (especially CPython). CPython perform almost no optimization of your code. If you want fast loops, then you need alternatives that compile them to native code or JIT them. You can use Cython or Numba for that. The two can provide simple ways to parallelize your program. Using Numba is probably the simplest solution in your case. You can start by looking the example programs.

更新:如果 Numpy 的实现是多线程的,那么多处理代码会慢很多.实际上,每个进程将在具有 N 个内核的机器上创建 N 个线程.因此将运行 N*N 个线程.这种情况称为过度订阅,并且效率低下(由于抢占式多任务处理,尤其是上下文切换).检验这一假设的一种方法是简单地查看创建了多少线程(例如,在 Posix 系统上使用 hwloc 工具)或简单地监视处理器使用情况.

Update: if the implementation of Numpy is multithreaded can, then the multiprocessing code can be much slower. Indeed, each process will create N threads on a machine with N cores. Consequently N*N threads will be run. This situation is called over-subscription and is known to be inefficient (due to preemptive multitasking and especially context-switches). One way to check this hypothesis is to simply look how many threads are created (eg. using the hwloc tool on Posix systems) or simply monitor the processor usage.

这篇关于为什么多处理会减慢嵌套的 for 循环?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆