基于numpy的计算效率低下的多重处理 [英] Inefficient multiprocessing of numpy-based calculations

查看:278
本文介绍了基于numpy的计算效率低下的多重处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Python的multiprocessing模块的帮助下并行化一些使用numpy的计算.请考虑以下简化示例:

I'm trying to parallelize some calculations that use numpy with the help of Python's multiprocessing module. Consider this simplified example:

import time
import numpy

from multiprocessing import Pool

def test_func(i):

    a = numpy.random.normal(size=1000000)
    b = numpy.random.normal(size=1000000)

    for i in range(2000):
        a = a + b
        b = a - b
        a = a - b

    return 1

t1 = time.time()
test_func(0)
single_time = time.time() - t1
print("Single time:", single_time)

n_par = 4
pool = Pool()

t1 = time.time()
results_async = [
    pool.apply_async(test_func, [i])
    for i in range(n_par)]
results = [r.get() for r in results_async]
multicore_time = time.time() - t1

print("Multicore time:", multicore_time)
print("Efficiency:", single_time / multicore_time)

当我执行它时,multicore_time大致等于single_time * n_par,而我希望它接近single_time.确实,如果我只用time.sleep(10)代替numpy计算,这就是我得到的-完美的效率.但是由于某些原因,它不适用于numpy.这可以解决吗,还是numpy的某些内部限制?

When I execute it, the multicore_time is roughly equal to single_time * n_par, while I would expect it to be close to single_time. Indeed, if I replace numpy calculations with just time.sleep(10), this is what I get — perfect efficiency. But for some reason it does not work with numpy. Can this be solved, or is it some internal limitation of numpy?

一些可能有用的附加信息:

Some additional info which may be useful:

  • 我正在使用OSX 10.9.5,Python 3.4.2,并且CPU是Core i7,具有(如系统信息所报告的)4个内核(尽管上述程序仅占用了50%的CPU时间) ,因此系统信息可能未考虑超线程.

  • I'm using OSX 10.9.5, Python 3.4.2 and the CPU is Core i7 with (as reported by the system info) 4 cores (although the above program only takes 50% of CPU time in total, so the system info may not be taking into account hyperthreading).

运行此命令时,我看到top中的n_par进程在100%CPU上工作

when I run this I see n_par processes in top working at 100% CPU

如果我用循环和按索引操作替换numpy数组操作,效率将大大提高(对于n_par = 4约为75%).

if I replace numpy array operations with a loop and per-index operations, the efficiency rises significantly (to about 75% for n_par = 4).

推荐答案

您正在使用的测试函数似乎受内存限制.这意味着您所看到的运行时间受到计算机将阵列从内存中拉到缓存中的速度的限制.例如,行a = a + b实际上使用了3个数组,分别是ab和将替换a的新数组.这三个数组各自约为8MB(1e6浮点数*每个浮点数8个字节).我相信不同的i7具有3MB-8MB的共享L3缓存,因此您无法一次将所有3个阵列都放入缓存中.您的cpu添加浮点数的速度比将数组加载到缓存中的速度快,因此大部分时间都花在等待从内存中读取数组上.由于缓存是在内核之间共享的,因此将工作分散到多个内核不会看到任何加速.

It looks like the test function you're using is memory bound. That means that the run time you're seeing is limited by how fast the computer can pull the arrays from memory into cache. For example, the line a = a + b is actually using 3 arrays, a, b and a new array that will replace a. These three arrays are about 8MB each (1e6 floats * 8 bytes per floats). I believe the different i7s have something like 3MB - 8MB of shared L3 cache so you cannot fit all 3 arrays in cache at once. Your cpu adds the floats faster than the array can be loaded into cache so most of the time is spent waiting on the array to be read from memory. Because the cache is shared between the cores, you don't see any speedup by spreading the work onto multiple cores.

内存绑定操作通常是numpy的问题,我知道处理它们的唯一方法是使用cython或numba之类的东西.

Memory bound operations are an issue for numpy in general and the only way I know to deal with them is to use something like cython or numba.

这篇关于基于numpy的计算效率低下的多重处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆