Matlab在GPU上的SVD [英] Matlab SVD on GPU

查看:232
本文介绍了Matlab在GPU上的SVD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Matlab R2014a测试SVD,似乎没有加速CPU与GPU。
我使用gtx 460和核心2 duo E8500。

I'm testing SVD in Matlab R2014a and it seems that there is not speedup CPU vs GPU. I use gtx 460 and core 2 duo E8500.

这是我的代码:

%test SVD
n=10000;
%host
Mh= rand(n,1000);
tic
%[Uh,Sh,Vh]= svd(Mh);
svd(Mh);
toc
%device
Md = gpuArray.rand(n,1000);
tic
%[Ud,Sd,Vd]= svd(Md);
svd(Md);
toc

运行时间与运行时不同,但CPU和GPU版本,那么为什么没有加速?

Also run time different from run to run but CPU and GPU version about the same, so why there is no speedup?

这里是一些测试

for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n);
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n);
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 43.124130 seconds.
Elapsed time is 43.842277 seconds.
Elapsed time is 42.993283 seconds.
Elapsed time is 44.293410 seconds.
Elapsed time is 42.924541 seconds.
Elapsed time is 43.730343 seconds.
Elapsed time is 43.125938 seconds.
Elapsed time is 43.645095 seconds.
Elapsed time is 43.492129 seconds.
Elapsed time is 43.459277 seconds.
Elapsed time is 43.327012 seconds.
Elapsed time is 44.040959 seconds.
Elapsed time is 43.242291 seconds.
Elapsed time is 43.390881 seconds.
Elapsed time is 43.275379 seconds.
Elapsed time is 43.408705 seconds.
Elapsed time is 43.320387 seconds.
Elapsed time is 44.232156 seconds.
Elapsed time is 42.984002 seconds.
Elapsed time is 43.702430 seconds.


for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n,'single');
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n,'single');
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 21.140301 seconds.
Elapsed time is 21.334361 seconds.
Elapsed time is 21.275991 seconds.
Elapsed time is 21.582602 seconds.
Elapsed time is 21.093408 seconds.
Elapsed time is 21.305413 seconds.
Elapsed time is 21.482931 seconds.
Elapsed time is 21.327842 seconds.
Elapsed time is 21.120969 seconds.
Elapsed time is 21.701752 seconds.
Elapsed time is 21.117268 seconds.
Elapsed time is 21.384318 seconds.
Elapsed time is 21.359225 seconds.
Elapsed time is 21.911570 seconds.
Elapsed time is 21.086259 seconds.
Elapsed time is 21.263040 seconds.
Elapsed time is 21.472175 seconds.
Elapsed time is 21.561370 seconds.
Elapsed time is 21.330314 seconds.
Elapsed time is 21.546260 seconds.


推荐答案

一般来说,SVD是一个很难的paralellize例程。您可以点击 此处

Generally SVD is a difficult to paralellize routine. You can check here that with a high end Tesla card, the speedup is not very impressive.

你有一个GTX460卡 - Fermi architecture 。该卡针对游戏(单精度计算)而不是HPC(双精度计算)进行了优化。 单精度/双精度吞吐量比为12.因此,卡具有873 GFLOPS SP / 72 GFLOPS DP 。检查 此处

You have a GTX460 card - Fermi architecture. The card is optimized for gaming (single precision computations), not HPC (double precision computation). The Single Precision / Double Precision throughput ratio is 12. So the card has 873 GFLOPS SP / 72 GFLOPS DP. Check here.

所以如果Md数组使用双精度元素,那么对它的计算会相当慢。此外,调用CPU例程时,所有CPU内核都将被利用,从而降低了在GPU上运行例程的可能性。此外,在GPU中,您可以花费时间将缓冲区传输到设备。

So if the Md array uses double precision elements, then the computation on it would be rather slow. Also there's a high chance that when calling the CPU routine, all CPU cores will get utilized, reducing the possible gain of running the routine on the GPU. Plus, in the GPU run you pay time for transferring the buffer to the device.

根据Divakar的建议,您可以使用 Md = single(Md)将数组转换为单精度并再次运行基准。你可以尝试使用更大的数据集大小,看看是否有变化。

Per Divakar's suggestion, you could use Md = single(Md) to convert your array to single precision and run the benchmark again. You can try and go with a bigger dataset size to see if something changes. I don't expect to much gain for this routine on your GPU.

更新1:

发布结果后,我看到DP / SP时间比率是2.在CPU端这是正常的,因为你可以适应2倍少 double 值。然而,在GPU一侧只有2的比例意味着gpu代码不能最好地利用SM核心 - 因为理论比率为12.换句话说,我会预期更好的SP性能优化的代码,与DP 相比。看起来不是这样的。

After you posted the results, I saw that the DP/SP time ratio is 2. On the CPU side this is normal, because you can fit 2 times less double values in SSE registers. However, a ratio of only 2 on the GPU side means that the gpu code does not make best use of the SM cores - because the theoretical ratio is 12. In other words, I would have expected much better SP performance for an optimized code, compared to DP. It seems that this is not the case.

这篇关于Matlab在GPU上的SVD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆