Multiprocessing.Pool 使 Numpy 矩阵乘法变慢 [英] Multiprocessing.Pool makes Numpy matrix multiplication slower

查看:79
本文介绍了Multiprocessing.Pool 使 Numpy 矩阵乘法变慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我在玩multiprocessing.PoolNumpy,但似乎我错过了一些重要的点.为什么 pool 版本要慢得多?我查看了 htop,我可以看到创建了多个进程,但它们都共享一个 CPU,加起来高达 ~100%.

$ cat test_multi.py将 numpy 导入为 np从时间导入时间从多处理导入池def mmul(矩阵):对于我在范围内(100):矩阵 = 矩阵 * 矩阵返回矩阵如果 __name__ == '__main__':矩阵 = []对于范围内的 i (4):matrices.append(np.random.random_integers(100, size=(1000, 1000)))池 = 池 (8)打印 timeit(lambda: map(mmul, 矩阵), number=20)打印时间(lambda:pool.map(mmul,矩阵),数字= 20)$ python test_multi.py16.026539087319.097837925

[更新]

  • 更改为 timeit 用于基准测试过程
  • 使用我的多个内核初始化池
  • 改变了计算,以便有更多的计算和更少的内存传输(我希望)

仍然没有变化.pool 版本仍然较慢,我可以在 htop 中看到只使用了一个核心,并且产生了多个进程.

[update2]

目前我正在阅读@Jan-Philip Gehrcke 关于使用 multiprocessing.Process()Queue 的建议.但与此同时,我想知道:

  1. 为什么我的示例适用于 tiago?它在我的机器上不起作用的原因可能是什么1?
  2. 在我的示例代码中,进程之间是否有任何复制?我打算让我的代码为每个线程提供一个矩阵列表的矩阵.
  3. 我的代码是不是一个不好的例子,因为我使用了 Numpy?

我了解到,当其他人知道我的最终目标时,通常会得到更好的答案:我有很多文件,它们以串行方式加载和处理 atm.处理是 CPU 密集型的,所以我认为并行化可以带来很多好处.我的目标是调用并行分析文件的 python 函数.此外,我认为这个函数只是 C 代码的一个接口,这会有所不同.

1 Ubuntu 12.04,Python 2.7.3,i7 860 @ 2.80 - 如果您需要更多信息,请发表评论.

[update3]

以下是 Stefano 示例代码的结果.出于某种原因,没有加速.:/

使用 16 个矩阵进行测试基数 4.271 5.072 4.764 4.718 4.7816 4.79用 32 个矩阵进行测试基数 8.821 10.392 10.584 10.738 9.4616 9.54使用 64 个矩阵进行测试基数 17.381 19.342 19.624 19.598 19.3916 19.34

[更新 4] 对 Jan-Philip Gehrcke 的回答评论

对不起,我没有让自己更清楚.正如我在更新 2 中所写,我的主要目标是并行化第三方 Python 库函数的许多串行调用.这个函数是一些 C 代码的接口.我被推荐使用 Pool,但这没有用,所以我尝试了一些更简单的方法,上面显示的带有 numpy 的示例.但在那里我也无法实现性能改进,即使它为我寻找可并行化".所以我想我一定错过了一些重要的事情.这些信息正是我通过这个问题和赏金寻找的.

[更新 5]

感谢您的大力投入.但是通读你的答案只会给我带来更多的问题.出于这个原因,我将阅读基础,并在我更清楚地了解我的内容时创建新的 SO 问题不知道.

解决方案

关于所有进程都在同一个 CPU 上运行的事实,在此处查看我的答案.

在导入过程中,numpy 改变了父进程的 CPU 亲和性,这样当你以后使用 Pool 时,它产生的所有工作进程最终都会争夺使用相同的内核,而不是使用您机器上可用的所有内核.

您可以在导入 numpy 后调用 taskset 来重置 CPU 关联,以便使用所有内核:

将 numpy 导入为 np导入操作系统从时间导入时间从多处理导入池def mmul(矩阵):对于我在范围内(100):矩阵 = 矩阵 * 矩阵返回矩阵如果 __name__ == '__main__':矩阵 = []对于范围内的 i (4):matrices.append(np.random.random_integers(100, size=(1000, 1000)))打印 timeit(lambda: map(mmul, 矩阵), number=20)# 导入numpy后,重置父进程的CPU亲和度,以便# 它将使用所有内核os.system("taskset -p 0xff %d" % os.getpid())池 = 池 (8)打印时间(lambda:pool.map(mmul,矩阵),数字= 20)

输出:

 $ python tmp.py12.4765810966pid 29150 的当前关联掩码:1pid 29150 的新亲和掩码:ff13.4136221409

如果您在运行此脚本时使用 top 观察 CPU 使用情况,您应该会看到它在执行并行"部分时使用了所有内核.正如其他人指出的那样,在您的原始示例中,涉及酸洗数据、流程创建等的开销可能超过并行化的任何可能好处.

我怀疑单个进程似乎始终更快的部分原因是 numpy 可能有一些加速元素矩阵的技巧当作业分布在多个核心上时,它无法使用乘法.

例如,如果我只使用普通的 Python 列表来计算斐波那契数列,我可以从并行化中获得巨大的加速.同样,如果我以不利用矢量化的方式进行逐元素乘法,对于并行版本,我会获得类似的加速:

将 numpy 导入为 np导入操作系统从时间导入时间从多处理导入池def fib(虚拟):n = [1,1]对于 xrange(100000) 中的 ii:n.append(n[-1]+n[-2])定义傻_多(矩阵):对于矩阵中的行:对于行中的 val:值 * 值如果 __name__ == '__main__':dt = timeit(lambda: map(fib, xrange(10)), number=10)打印斐波那契,非平行:%.3f"%dt矩阵 = [np.random.randn(1000,1000) for ii in xrange(10)]dt = timeit(lambda: map(silly_mult, 矩阵), number=10)打印愚蠢的矩阵乘法,非平行:%.3f"%dt# 导入numpy后,重置父进程的CPU亲和度,以便# 它将使用所有 CPUos.system("taskset -p 0xff %d" % os.getpid())池 = 池 (8)dt = timeit(lambda: pool.map(fib,xrange(10)), number=10)打印斐波那契,平行:%.3f"%dtdt = timeit(lambda: pool.map(silly_mult, 矩阵), number=10)打印愚蠢的矩阵乘法,并行:%.3f"%dt

输出:

$ python tmp.py斐波那契,非平行:32.449愚蠢的矩阵乘法,非并行:40.084pid 29528 的当前关联掩码:1pid 29528 的新关联掩码:ff斐波那契,平行线:9.462愚蠢的矩阵乘法,并行:12.163

So, I am playing around with multiprocessing.Pool and Numpy, but it seems I missed some important point. Why is the pool version much slower? I looked at htop and I can see several processes be created, but they all share one of the CPUs adding up to ~100%.

$ cat test_multi.py 
import numpy as np
from timeit import timeit
from multiprocessing import Pool


def mmul(matrix):
    for i in range(100):
        matrix = matrix * matrix
    return matrix

if __name__ == '__main__':
    matrices = []
    for i in range(4):
        matrices.append(np.random.random_integers(100, size=(1000, 1000)))

    pool = Pool(8)
    print timeit(lambda: map(mmul, matrices), number=20)
    print timeit(lambda: pool.map(mmul, matrices), number=20)

$ python test_multi.py 
16.0265390873
19.097837925

[update]

  • changed to timeit for benchmarking processes
  • init Pool with a number of my cores
  • changed computation so that there is more computation and less memory transfer (I hope)

Still no change. pool version is still slower and I can see in htop that only one core is used also several processes are spawned.

[update2]

At the moment I am reading about @Jan-Philip Gehrcke's suggestion to use multiprocessing.Process() and Queue. But in the meantime I would like to know:

  1. Why does my example work for tiago? What could be the reason it is not working on my machine1?
  2. Is in my example code any copying between the processes? I intended my code to give each thread one matrix of the matrices list.
  3. Is my code a bad example, because I use Numpy?

I learned that often one gets better answer, when the others know my end goal so: I have a lot of files, which are atm loaded and processed in a serial fashion. The processing is CPU intense, so I assume much could be gained by parallelization. My aim is it to call the python function that analyses a file in parallel. Furthermore this function is just an interface to C code, I assume, that makes a difference.

1 Ubuntu 12.04, Python 2.7.3, i7 860 @ 2.80 - Please leave a comment if you need more info.

[update3]

Here are the results from Stefano's example code. For some reason there is no speed up. :/

testing with 16 matrices
base  4.27
   1  5.07
   2  4.76
   4  4.71
   8  4.78
  16  4.79
testing with 32 matrices
base  8.82
   1 10.39
   2 10.58
   4 10.73
   8  9.46
  16  9.54
testing with 64 matrices
base 17.38
   1 19.34
   2 19.62
   4 19.59
   8 19.39
  16 19.34

[update 4] answer to Jan-Philip Gehrcke's comment

Sorry that I haven't made myself clearer. As I wrote in Update 2 my main goal is it to parallelize many serial calls of a 3rd party Python library function. This function is an interface to some C code. I was recommended to use Pool, but this didn't work, so I tried something simpler, the shown above example with numpy. But also there I could not achieve a performance improvement, even though it looks for me 'emberassing parallelizable`. So I assume I must have missed something important. This information is what I am looking for with this question and bounty.

[update 5]

Thanks for all your tremendous input. But reading through your answers only creates more questions for me. For that reason I will read about the basics and create new SO questions when I have a clearer understanding of what I don't know.

解决方案

Regarding the fact that all of your processes are running on the same CPU, see my answer here.

During import, numpy changes the CPU affinity of the parent process, such that when you later use Pool all of the worker processes that it spawns will end up vying for for the same core, rather than using all of the cores available on your machine.

You can call taskset after you import numpy to reset the CPU affinity so that all cores are used:

import numpy as np
import os
from timeit import timeit
from multiprocessing import Pool


def mmul(matrix):
    for i in range(100):
        matrix = matrix * matrix
    return matrix

if __name__ == '__main__':

    matrices = []
    for i in range(4):
        matrices.append(np.random.random_integers(100, size=(1000, 1000)))

    print timeit(lambda: map(mmul, matrices), number=20)

    # after importing numpy, reset the CPU affinity of the parent process so
    # that it will use all cores
    os.system("taskset -p 0xff %d" % os.getpid())

    pool = Pool(8)
    print timeit(lambda: pool.map(mmul, matrices), number=20)

Output:

    $ python tmp.py                                     
    12.4765810966
    pid 29150's current affinity mask: 1
    pid 29150's new affinity mask: ff
    13.4136221409

If you watch CPU useage using top while you run this script, you should see it using all of your cores when it executes the 'parallel' part. As others have pointed out, in your original example the overhead involved in pickling data, process creation etc. probably outweigh any possible benefit from parallelisation.

Edit: I suspect that part of the reason why the single process seems to be consistently faster is that numpy may have some tricks for speeding up that element-wise matrix multiplication that it cannot use when the jobs are spread across multiple cores.

For example, if I just use ordinary Python lists to compute the Fibonacci sequence, I can get a huge speedup from parallelisation. Likewise, if I do element-wise multiplication in a way that takes no advantage of vectorization, I get a similar speedup for the parallel version:

import numpy as np
import os
from timeit import timeit
from multiprocessing import Pool

def fib(dummy):
    n = [1,1]
    for ii in xrange(100000):
        n.append(n[-1]+n[-2])

def silly_mult(matrix):
    for row in matrix:
        for val in row:
            val * val

if __name__ == '__main__':

    dt = timeit(lambda: map(fib, xrange(10)), number=10)
    print "Fibonacci, non-parallel: %.3f" %dt

    matrices = [np.random.randn(1000,1000) for ii in xrange(10)]
    dt = timeit(lambda: map(silly_mult, matrices), number=10)
    print "Silly matrix multiplication, non-parallel: %.3f" %dt

    # after importing numpy, reset the CPU affinity of the parent process so
    # that it will use all CPUS
    os.system("taskset -p 0xff %d" % os.getpid())

    pool = Pool(8)

    dt = timeit(lambda: pool.map(fib,xrange(10)), number=10)
    print "Fibonacci, parallel: %.3f" %dt

    dt = timeit(lambda: pool.map(silly_mult, matrices), number=10)
    print "Silly matrix multiplication, parallel: %.3f" %dt

Output:

$ python tmp.py
Fibonacci, non-parallel: 32.449
Silly matrix multiplication, non-parallel: 40.084
pid 29528's current affinity mask: 1
pid 29528's new affinity mask: ff
Fibonacci, parallel: 9.462
Silly matrix multiplication, parallel: 12.163

这篇关于Multiprocessing.Pool 使 Numpy 矩阵乘法变慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆