多重处理池使Numpy矩阵乘法变慢 [英] Multiprocessing.Pool makes Numpy matrix multiplication slower

查看:67
本文介绍了多重处理池使Numpy矩阵乘法变慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我在玩multiprocessing.PoolNumpy,但是似乎我错过了一些重要的观点.为什么pool版本慢得多?我查看了htop,可以看到创建了多个进程,但是它们全部共享一个CPU,总计约100%.

So, I am playing around with multiprocessing.Pool and Numpy, but it seems I missed some important point. Why is the pool version much slower? I looked at htop and I can see several processes be created, but they all share one of the CPUs adding up to ~100%.

$ cat test_multi.py 
import numpy as np
from timeit import timeit
from multiprocessing import Pool


def mmul(matrix):
    for i in range(100):
        matrix = matrix * matrix
    return matrix

if __name__ == '__main__':
    matrices = []
    for i in range(4):
        matrices.append(np.random.random_integers(100, size=(1000, 1000)))

    pool = Pool(8)
    print timeit(lambda: map(mmul, matrices), number=20)
    print timeit(lambda: pool.map(mmul, matrices), number=20)

$ python test_multi.py 
16.0265390873
19.097837925

[更新]

  • 更改为timeit进行基准测试过程
  • 使用多个核心初始化池
  • 更改了计算,以便进行更多的计算,并减少内存传输(希望如此)
  • changed to timeit for benchmarking processes
  • init Pool with a number of my cores
  • changed computation so that there is more computation and less memory transfer (I hope)

仍然没有变化. pool版本仍然较慢,我可以在htop中看到仅使用一个内核,并且产生了多个进程.

Still no change. pool version is still slower and I can see in htop that only one core is used also several processes are spawned.

[update2]

目前,我正在阅读@ Jan-Philip Gehrcke关于使用multiprocessing.Process()Queue的建议.但是与此同时,我想知道:

At the moment I am reading about @Jan-Philip Gehrcke's suggestion to use multiprocessing.Process() and Queue. But in the meantime I would like to know:

  1. 为什么我的示例适用于tiago?它可能无法在我的计算机上运行的原因是什么 1 ?
  2. 在我的示例代码中,进程之间是否有任何复制?我希望我的代码为每个线程提供矩阵列表的一个矩阵.
  3. 我的代码是一个不好的例子,因为我使用了Numpy?
  1. Why does my example work for tiago? What could be the reason it is not working on my machine1?
  2. Is in my example code any copying between the processes? I intended my code to give each thread one matrix of the matrices list.
  3. Is my code a bad example, because I use Numpy?

我了解到,当其他人知道我的最终目标时,通常一个人会得到更好的答案:我有很多文件,这些文件是以串行方式加载和处理的atm.该处理占用大量CPU资源,因此我认为并行化可以带来很多好处.我的目的是调用并行分析文件的python函数.而且,我认为该功能只是C代码的接口,这有所作为.

I learned that often one gets better answer, when the others know my end goal so: I have a lot of files, which are atm loaded and processed in a serial fashion. The processing is CPU intense, so I assume much could be gained by parallelization. My aim is it to call the python function that analyses a file in parallel. Furthermore this function is just an interface to C code, I assume, that makes a difference.

1 Ubuntu 12.04,Python 2.7.3, i7 860 @ 2.80-如果您需要更多信息,请发表评论.

1 Ubuntu 12.04, Python 2.7.3, i7 860 @ 2.80 - Please leave a comment if you need more info.

[update3]

这是Stefano的示例代码的结果.由于某些原因,无法加快速度. :/

Here are the results from Stefano's example code. For some reason there is no speed up. :/

testing with 16 matrices
base  4.27
   1  5.07
   2  4.76
   4  4.71
   8  4.78
  16  4.79
testing with 32 matrices
base  8.82
   1 10.39
   2 10.58
   4 10.73
   8  9.46
  16  9.54
testing with 64 matrices
base 17.38
   1 19.34
   2 19.62
   4 19.59
   8 19.39
  16 19.34

[更新4]对 Jan-Philip Gehrcke的答案评论

[update 4] answer to Jan-Philip Gehrcke's comment

抱歉,我没有让自己更清楚.正如我在Update 2中所写的那样,我的主要目标是并行化第三方Python库函数的许多串行调用.此函数是一些C代码的接口.建议使用Pool,但此方法不起作用,因此我尝试了一些更简单的方法,如上例所示的numpy.但是,即使它看起来让我难以理解可并行化",但我仍然无法实现性能上的改进.所以我想我一定错过了一些重要的事情.这些信息是我在寻找这个问题和赏金时所需的信息.

Sorry that I haven't made myself clearer. As I wrote in Update 2 my main goal is it to parallelize many serial calls of a 3rd party Python library function. This function is an interface to some C code. I was recommended to use Pool, but this didn't work, so I tried something simpler, the shown above example with numpy. But also there I could not achieve a performance improvement, even though it looks for me 'emberassing parallelizable`. So I assume I must have missed something important. This information is what I am looking for with this question and bounty.

[更新5]

感谢您的大量投入.但是通读您的答案只会给我带来更多问题.因此,当我对什么内容有更清楚的了解时,我将阅读基础并提出新的SO问题.我不知道.

Thanks for all your tremendous input. But reading through your answers only creates more questions for me. For that reason I will read about the basics and create new SO questions when I have a clearer understanding of what I don't know.

推荐答案

关于所有进程都在同一CPU上运行的事实,

Regarding the fact that all of your processes are running on the same CPU, see my answer here.

在导入过程中,numpy更改父进程的CPU亲和力,这样,当您以后使用Pool时,它产生的所有辅助进程最终都将争夺同一核心,而不是使用所有您的计算机上可用的内核.

During import, numpy changes the CPU affinity of the parent process, such that when you later use Pool all of the worker processes that it spawns will end up vying for for the same core, rather than using all of the cores available on your machine.

导入numpy后,可以调用taskset来重置CPU亲和力,以便使用所有内核:

You can call taskset after you import numpy to reset the CPU affinity so that all cores are used:

import numpy as np
import os
from timeit import timeit
from multiprocessing import Pool


def mmul(matrix):
    for i in range(100):
        matrix = matrix * matrix
    return matrix

if __name__ == '__main__':

    matrices = []
    for i in range(4):
        matrices.append(np.random.random_integers(100, size=(1000, 1000)))

    print timeit(lambda: map(mmul, matrices), number=20)

    # after importing numpy, reset the CPU affinity of the parent process so
    # that it will use all cores
    os.system("taskset -p 0xff %d" % os.getpid())

    pool = Pool(8)
    print timeit(lambda: pool.map(mmul, matrices), number=20)

输出:

    $ python tmp.py                                     
    12.4765810966
    pid 29150's current affinity mask: 1
    pid 29150's new affinity mask: ff
    13.4136221409

如果在运行此脚本时使用top查看CPU使用率,则在执行并行"部分时应使用所有内核查看它.正如其他人所指出的那样,在您的原始示例中,酸洗数据,创建流程等所涉及的开销可能超过了并行化带来的任何好处.

If you watch CPU useage using top while you run this script, you should see it using all of your cores when it executes the 'parallel' part. As others have pointed out, in your original example the overhead involved in pickling data, process creation etc. probably outweigh any possible benefit from parallelisation.

我怀疑单个过程似乎持续保持更快的部分原因是numpy可能有一些技巧可以加速它无法使用的元素级矩阵乘法当工作分散到多个核心时.

I suspect that part of the reason why the single process seems to be consistently faster is that numpy may have some tricks for speeding up that element-wise matrix multiplication that it cannot use when the jobs are spread across multiple cores.

例如,如果我仅使用普通的Python列表来计算Fibonacci序列,那么并行化可以大大提高速度.同样,如果我以不利用向量化优势的方式进行元素级乘法,则并行版本会获得类似的加速:

For example, if I just use ordinary Python lists to compute the Fibonacci sequence, I can get a huge speedup from parallelisation. Likewise, if I do element-wise multiplication in a way that takes no advantage of vectorization, I get a similar speedup for the parallel version:

import numpy as np
import os
from timeit import timeit
from multiprocessing import Pool

def fib(dummy):
    n = [1,1]
    for ii in xrange(100000):
        n.append(n[-1]+n[-2])

def silly_mult(matrix):
    for row in matrix:
        for val in row:
            val * val

if __name__ == '__main__':

    dt = timeit(lambda: map(fib, xrange(10)), number=10)
    print "Fibonacci, non-parallel: %.3f" %dt

    matrices = [np.random.randn(1000,1000) for ii in xrange(10)]
    dt = timeit(lambda: map(silly_mult, matrices), number=10)
    print "Silly matrix multiplication, non-parallel: %.3f" %dt

    # after importing numpy, reset the CPU affinity of the parent process so
    # that it will use all CPUS
    os.system("taskset -p 0xff %d" % os.getpid())

    pool = Pool(8)

    dt = timeit(lambda: pool.map(fib,xrange(10)), number=10)
    print "Fibonacci, parallel: %.3f" %dt

    dt = timeit(lambda: pool.map(silly_mult, matrices), number=10)
    print "Silly matrix multiplication, parallel: %.3f" %dt

输出:

$ python tmp.py
Fibonacci, non-parallel: 32.449
Silly matrix multiplication, non-parallel: 40.084
pid 29528's current affinity mask: 1
pid 29528's new affinity mask: ff
Fibonacci, parallel: 9.462
Silly matrix multiplication, parallel: 12.163

这篇关于多重处理池使Numpy矩阵乘法变慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆