numba-guvectorize仅比jit快 [英] numba - guvectorize barely faster than jit

查看:297
本文介绍了numba-guvectorize仅比jit快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图并行化对许多独立数据集运行的蒙特卡洛模拟.我发现numba的并行guvectorize实现仅比numba jit实现快30-40%.

I was trying to parallellize a Monte Carlo simulation that operates on many independent datasets. I found out that numba's parallel guvectorize implementation was barely 30-40% faster than the numba jit implementation.

我找到了这些( 1 2 )关于Stackoverflow的可比较主题,但它们并不能真正回答我的问题.在第一种情况下,由于回退到对象模式而减慢了实现速度;在第二种情况下,原始发布者未正确使用guvectorize-这些问题均不适用于我的代码.

I found these (1, 2) comparable topics on Stackoverflow, but they do not really answer my question. In the first case, the implementation is slowed down by a fall back to object mode and in the second case the original poster did not properly use guvectorize - none of these problems apply to my code.

为了确保我的代码没有问题,我创建了这段非常简单的代码来比较jit和guvectorize:

To make sure there was no problem with my code, I created this very simple piece of code to compare jit to guvectorize:

import timeit
import numpy as np
from numba import jit, guvectorize

#both functions take an (m x n) array as input, compute the row sum, and return the row sums in a (m x 1) array

@guvectorize(["void(float64[:], float64[:])"], "(n) -> ()", target="parallel", nopython=True)
def row_sum_gu(input, output) :
    output[0] = np.sum(input)

@jit(nopython=True)
def row_sum_jit(input_array, output_array) :
    m, n = input_array.shape
    for i in range(m) :
        output_array[i] = np.sum(input_array[i,:])

rows = int(64) #broadcasting (= supposed parallellization) dimension for guvectorize
columns = int(1e6)
input_array = np.ones((rows, columns))
output_array = np.zeros((rows))
output_array2 = np.zeros((rows))

#the first run includes the compile time
row_sum_jit(input_array, output_array)
row_sum_gu(input_array, output_array2)

#run each function 100 times and record the time
print("jit time:", timeit.timeit("row_sum_jit(input_array, output_array)", "from __main__ import row_sum_jit, input_array, output_array", number=100))
print("guvectorize time:", timeit.timeit("row_sum_gu(input_array, output_array2)", "from __main__ import row_sum_gu, input_array, output_array2", number=100))

这给了我以下输出(时间确实有所不同):

This gives me the following output (the times do vary a bit):

jit time: 12.04114792868495
guvectorize time: 5.415564753115177

因此,即使并行代码仅使用所有cpu内核和jit代码,并行代码也几乎快两倍(仅当行数是CPU内核数的整数倍时,否则性能优势会降低).使用一个(使用htop验证).

Thus again, the parallel code is barely two times faster (only when the number of rows is an integer multiple of the number of CPU cores, otherwise the performance advantage diminishes) even though it utilizes all cpu cores and the jit code only uses one (verified using htop).

我正在具有4个AMD Opteron 6380 CPU(总共有64个内核),256 GB RAM和Red Hat 4.4.7-1 OS的机器上运行此程序. 我将Anaconda 4.2.0与Python 3.5.2和Numba 0.26.0结合使用.

I am running this on a machine with 4x AMD Opteron 6380 CPU (so 64 cores in total), 256 GB of RAM and Red Hat 4.4.7-1 OS. I use Anaconda 4.2.0 with Python 3.5.2 and Numba 0.26.0.

如何进一步提高并行性能,或者我做错了什么?

How can I further improve the parallel performance or what am I doing wrong?

谢谢您的回答.

推荐答案

这是因为np.sum太简单了.用求和处理数组不仅受CPU的限制,还受内存访问"时间的限制.因此,向其扔更多的内核并不会造成很大的不同(当然,这取决于与CPU相关的内存访问速度).

That's because np.sum is too simple. Processing an array with sum is not only limited by CPU but also by the "memory access" time. So throwing more cores at it doesn't make much of a difference (of course that depends on how fast the memory access in relation to your CPU is).

只是为了进行虚拟化np.sum就是这样(忽略data以外的任何参数):

Just for vizualisation np.sum is something like this (ignoring any parameter other than the data):

def sum(data):
    sum_ = 0.
    data = data.ravel()
    for i in data.size:
        item = data[i]   # memory access (I/O bound)
        sum_ += item     # addition      (CPU bound)
    return sum

因此,如果大部分时间都花在访问内存上,那么对它进行并行化处理就不会看到真正的加速.但是,如果CPU限制任务是瓶颈,那么使用更多的内核将大大加快代码的速度.

So if most of the time is spent accessing the memory you won't see any real speedups if you parallize it. However if the CPU bound task is the bottleneck then using more cores will speedup your code significantly.

例如,如果您包括比加法慢一些的操作,则会看到更大的改进:

For example if you include some slower operations than addition you'll see a bigger improvement:

from math import sqrt
from numba import njit, jit, guvectorize
import timeit
import numpy as np

@njit
def square_sum(arr):
    a = 0.
    for i in range(arr.size):
        a = sqrt(a**2 + arr[i]**2)  # sqrt and square are cpu-intensive!
    return a

@guvectorize(["void(float64[:], float64[:])"], "(n) -> ()", target="parallel", nopython=True)
def row_sum_gu(input, output) :
    output[0] = square_sum(input)

@jit(nopython=True)
def row_sum_jit(input_array, output_array) :
    m, n = input_array.shape
    for i in range(m) :
        output_array[i] = square_sum(input_array[i,:])
    return output_array

我在这里使用了 IPythons timeit ,但应该等效:

I used IPythons timeit here but it should be equivalent:

rows = int(64)
columns = int(1e6)

input_array = np.random.random((rows, columns))
output_array = np.zeros((rows))

# Warmup an check that they are equal 
np.testing.assert_equal(row_sum_jit(input_array, output_array), row_sum_gu(input_array, output_array2))
%timeit row_sum_jit(input_array, output_array.copy())  # 10 loops, best of 3: 130 ms per loop
%timeit row_sum_gu(input_array, output_array.copy())   # 10 loops, best of 3: 35.7 ms per loop

我仅使用4个内核,因此非常接近加速的极限!

I'm only using 4 cores so that's pretty close to the limit of possible speedup!

请记住,如果作业受CPU限制,那么并行计算只能大大加快计算速度.

Just remember that parallel computation can only significantly speedup your calculation if the job is limited by the CPU.

这篇关于numba-guvectorize仅比jit快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆