Numbapro:矩阵乘法没有加速 [英] Numbapro: No speed-up for Matrix Multiplication
问题描述
最近几天,我一直在试图理解为什么Numbapro(Accelerate from Continuum Analytics,Inc .;我运行了30天的试用版)不能在我的MacBook Pro(Intel Core i7,2.6GHz, 16GB RAM,NVIDIA GeForce GT 650M,PCI总线上1GB)。
For last couple of days I've been trying to understand why Numbapro (Accelerate from Continuum Analytics, Inc.; I'm running a 30day trial version) does not accelerate on my MacBook Pro (Intel Core i7, 2.6GHz, 16GB RAM with NVIDIA GeForce GT 650M, 1GB on PCI bus).
我使用了(NxM)x(MxN)矩阵乘法的代码中的一个例子,Continuum Analytics,Inc.声称通过CUDA加速计算,我比较了CUDA.JIT和numpy之间的时间。我的想法是运行例如1e4 迭代,矩阵B每次迭代都是随机的。下面我使用的代码,我引用我获得的时间。有什么解决方案吗?谢谢!
I took one of the examples from the codes for (NxM)x(MxN) matrix multiplication where Continuum Analytics, Inc. claims acceleration of computation via CUDA and I compared the times between CUDA.JIT and numpy. My idea is to run e.g 1e4 iterations and matrix B is randomised every iteration. Below the following code I used, I quote times I obtained. Is there any solution for that? Thanks!
from numbapro import *
from numba import *
import numpy as np
import math
from timeit import default_timer as timer
m=1000
n=1000
A = np.array(np.random.random((n,m)), dtype=np.float32)
C = np.empty([n,n])
iterations = 10000
start = timer()
for i in range(iterations):
B = np.array(np.random.random((m,n)), dtype=np.float32)
X=np.dot(A,B)
numpy_time=(timer() - start)
@cuda.jit(void(float32[:,:],float32[:,:],float32[:,:]))
def cu_square_matrix_mul(A, B, C):
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
bw = cuda.blockDim.x
bh = cuda.blockDim.y
x = tx + bx * bw
y = ty + by * bh
n = C.shape[0]
if x >= n or y >= n:
return
cs = 0
for i in range(n):
cs += A[y,i]*B[i,x]
C[y,x]= cs
cuda.syncthreads()
blockdim = 256,3
griddim = 10,3
stream = cuda.stream()
dA = cuda.to_device(A, stream)
dC = cuda.to_device(C, stream)
start = timer()
for i in range(iterations):
B = np.array(np.random.random((m,n)), dtype=np.float32)
dB = cuda.to_device(B, stream)
cu_square_matrix_mul[griddim,blockdim,stream](dA, dB, dC)
dC.to_host()
stream.synchronize()
cuda_time = (timer() - start)
print
print("Numpy took %f seconds" % numpy_time)
print("CUDA JIT took %f seconds, %.5fx speedup" % (cuda_time, numpy_time / cuda_time))
会导致:
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 30 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 30 days
Vendor: Continuum Analytics, Inc.
Package: numbapro
Message: trial mode expires in 30 days
Numpy took 378.328881 seconds
CUDA JIT took 342.723757 seconds, 1.10389x speedup
推荐答案
这是GPU上一个完全天真的矩阵乘法程序,而numpy程序实际上是一个库调用:
This is a completely naive matrix multiplication routine on the GPU, whereas the numpy routine, being effectively a library call:
X=np.dot(A,B)
我感到印象深刻的是GPU的速度更快。
is likely to be highly optimized. I'm impressed that the GPU is faster at all.
解决方案将是调用CUBLAS 进行矩阵复用,而不是编写您自己的内核。
The "solution" would be to make a call to CUBLAS for the matrix mutliplication, rather than writing your own kernel.
这篇关于Numbapro:矩阵乘法没有加速的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!