为什么MATLAB gpuarray仅添加两个矩阵要慢得多? [英] why MATLAB gpuarray is much slower in just adding two matrices?

查看:280
本文介绍了为什么MATLAB gpuarray仅添加两个矩阵要慢得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近使用MATLAB CUDA库在gpu上进行了一些绝对简单的矩阵计算.但是性能结果却很奇怪. 任何人都可以帮助我了解到底发生了什么以及如何解决该问题.提前致谢. 请注意,以下代码在geforce GTX TITAN black gpu上运行.

I have recently employed MATLAB CUDA library for some absolutely simple matrix calculations on gpu. But the performance results are very strange. could any body help me understand what exactly is going on and how I can solve the issue. Thanks in advance. Please note that the following codes are run on geforce GTX TITAN black gpu.

假设a0,a1,... a6为1000 * 1000 gpuarrays,U = 0.5和V = 0.0

assume a0,a1,...a6 be 1000*1000 gpuarrays and U=0.5 and V=0.0

titan = gpuDevice();
tic();

for i=1:10000
a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+a2(1,1)+a3(1,1))-(a5(1,1).*U./3.0)-(a5(1,1).*V./2.0)+(0.25.*a5(1,1).*a4(1,1));  
end

wait(titan);
time = toc()

时间结果= 17.98秒

the result for time=17.98 seconds

现在重新定义a0,a1,... a6以及要在cpu上使用的U和V,并计算所需的时间:

now re-defining a0,a1,...a6 and U and V for employment on cpu and calculating the time needed:

tic();

for i=1:10000
a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+a2(1,1)+a3(1,1))-(a5(1,1).*U./3.0)-(a5(1,1).*V./2.0)+(0.25.*a5(1,1).*a4(1,1));  
end

time= toc()  

时间= 0.0098秒的结果

the result for time=0.0098 seconds

因此在CPU上快1800倍!!!

therefore more than 1800 times faster on cpu!!!!

然后我决定对整个矩阵而不是特定元素进行以前的计算,结果如下:

then I decided to do the previous calculations on the whole matrix rather than specific elements, and here are the results:

在gpu上运行的结果:

Results for the run on gpu:

titan = gpuDevice();
tic();
for i=1:10000
a6=(0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4);  
end
wait(titan);
time = toc()   

时间的结果= 6.32秒 这意味着在整个矩阵上的运算比在特定元素上的运算要快得多!

the result for time=6.32 seconds which means that the operation on the whole matrix is much faster than on a specific element!

在CPU上运行的结果:

Results for the run on CPU:

tic();
for i=1:10000
a6=(0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4);  
end

time= toc()  

时间= 35.2秒的结果

the result for time=35.2 seconds

这是最令人惊讶的结果: 假设a0,a1,... a6以及U和V只是1 * 1 gpuarrays并运行以下命令:

AND HERE IS THE MOST SURPRISING RESULT: assuming a0,a1,...a6 and U and V to be just 1*1 gpuarrays and running the following:

titan = gpuDevice();
tic();
for i=1:10000
a6=(0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4);  
end
wait(titan);
time = toc()  

时间= 7.8秒的结果

the result for time=7.8 seconds

它甚至比相应的1000 * 1000情况还要慢!

it is even slower than the corresponding 1000*1000 case!

不幸的是,这条线 a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+ a2(1,1)+ a3(1,1)) -(a5(1,1).* U./3.0)-(a5(1,1).*V./2.0)+(0.25.*a5(1,1).*a4(1,1)) ; 是大约100条线中的一条线,全部在一个for循环中,这条线证明了自己是一个真正的瓶颈,占用了大约50%的所有计算时间! 有人可以帮我吗?请注意,将这部分计算转移到cpu上是不可行的,因为瓶颈行处于for循环中,并且在每次迭代中将a1,... a6发送到cpu并将结果调用到gpu会更加耗时. 任何建议都非常感谢.

Unfortunately the line a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+a2(1,1)+a3(1,1))-(a5(1,1).*U./3.0)-(a5(1,1).*V./2.0)+(0.25.*a5(1,1).*a4(1,1)); is one of the lines among about 100 lines, all in a single for-loop and this line proved itself as a real bottleneck taking about 50% of all calculation time needed! could anybody help me? note that transferring this part of calculations on cpu is not a choice because the bottleneck line is in a for-loop and sending a1,...a6 to cpu and calling the results to gpu in each iteration is much more time consuming. any advice is really really appreciated.

推荐答案

我认为您的第二个GPU结果(即矢量化GPU调用)最相关-当以矢量化方式处理大量数据时,GPU效率最高.就您而言,将表达式转换为arrayfun调用可能会获得更好的性能. arrayfun允许MATLAB在GPU上将整个表达式转换为单个操作,从而充分利用了设备(巨大)的可用内存带宽.

I think your second GPU result (i.e. vectorised GPU calls) is the most pertinent - GPUs are most efficient when operating on large amounts of data in a vectorised fashion. In your case, you can probably get even better performance by converting your expression into an arrayfun call. arrayfun allows MATLAB to convert the entire expression into a single operation on the GPU, which takes best advantage of the (huge) available memory bandwidth of the device.

关于您在计算a6(1,1)时遇到的问题-也许最好是计算整个数组(即不要索引右侧的表达式),然后再进行索引.像

As to your problem calculating a6(1,1) - perhaps it might be best to calculate the whole array (i.e. don't index the right-hand-side expressions) and then index afterwards. Something like

tmp = (0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4);
a6(1,1) = tmp(1,1);

这篇关于为什么MATLAB gpuarray仅添加两个矩阵要慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆