使用bsxfun和GPU在Matlab中向量化嵌套循环 [英] Vectorizing nested loops in matlab using bsxfun and with GPU

查看:101
本文介绍了使用bsxfun和GPU在Matlab中向量化嵌套循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

For循环似乎非常慢,所以我想知道是否可以使用bsxfun对下所示代码中的嵌套循环进行矢量化,也许还可以引入GPU.

For loops seem to be extremely slow, so I was wondering if the nested loops in the code shown next could be vectorized using bsxfun and maybe GPU could be introduced too.

代码

%// Paramaters
i = 1;
j = 3;
n1 = 1500;
n2 = 1500;

%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;

%// Nested Loops - I 
for x = 1:n1
    for y = 1:n1
        num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
        LInc(x, y) = L1(x, y) + (num/denom);
        LInc(y, x) = LInc(x, y);
    end
end

%// Nested Loops - II
for x = 1:n1
    for y = 1:n2
        num = (n1 * n * L1(x,i)) + (n2 * n * L2(y,j)) - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1)));
        LInc(x, n1+y) = num/denom;
        LInc(n1+y, x) = LInc(x, n1+y);
    end
end

ndenom也可以视为常量.

Edit 1: n and denom could be assumed as constants too.

推荐答案

下面是矢量化的CPUGPU代码,我希望我至少对GPU代码和以后的基准测试使用良好的做法上.

Here are vectorized CPU and GPU codes and I am hoping that I am using at least good practices for the GPU code and the benchmarking later on.

%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;

%// Calculate num/denom value for stage 1 and 2
nd1 = L1 + (((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - n2*n*bsxfun(@plus,L1(:,i),L1(:,i).'))./denom; %//'
nd2 = (bsxfun(@plus,n1*n*L1(:,i),n2*n*L2(:,j).') - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1))))./denom; %//'

%// Plug in the values in the output matrix
LInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
LInc(n1+1:end,1:n1) = nd2.';  %//'
LInc(1:n1,n1+1:end) = nd2;

GPU代码

%// Pre-allocate for output
gLInc = zeros(n1+n2,n1+n2,'gpuArray');

%// Convert to gpu arrays
gL1 = gpuArray(L1);
gL2 = gpuArray(L2);

%// Calculate num/denom value for stage 1 and 2
nd1 = gL1 + (((n2 ^ 2) * (gL1(i, i) + gL2(j, j) + 1)) - n2*n*bsxfun(@plus,gL1(:,i),gL1(:,i).'))./denom; %//'
nd2 = (bsxfun(@plus,n1*n*gL1(:,i),n2*n*gL2(:,j).') - ((n1 * n2 * (gL1(i, i) + gL2(j, j) + 1))))./denom; %//'

%// Plug in the values in the output matrix
gLInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
gLInc(n1+1:end,1:n1) = nd2.';  %//'
gLInc(1:n1,n1+1:end) = nd2;

%// Gather data from GPU back to CPU
LInc = gather(gLInc);

基准化

GPU基准测试技巧摘自测量和改进GPU性能.

%// Warm up GPU call with insignificant small scalar inputs, just in case
%// gputimeit doesn't do the same
temp1 = modp2(1,1,1,1,1,1,1,1); %// This is vectorized GPU code

i = 1;
j = 3;
n = 1000; %// Assumed
denom = 1e6;  %// Assumed

N_arr = [50 100 200 500 1000 1500]; %// array elements for N (datasize)
timeall = zeros(3,numel(N_arr));

for k1 = 1:numel(N_arr)
    N = N_arr(k1);
    n1 = N;  %// n1, n2 are assumed identical for less-complicated benchmarking
    n2 = N;

    L1 = rand(n1,n1);
    L2 = rand(n2,j);

    f = @() modp0(i,j,n1,n2,L1,L2,n,denom);%// Original CPU w/ preallocation
    timeall(1,k1) = timeit(f);
    clear f

    f = @() modp1(i,j,n1,n2,L1,L2,n,denom);%// Vectorzied CPU code
    timeall(2,k1) = timeit(f);
    clear f

    f = @() modp2(i,j,n1,n2,L1,L2,n,denom);%// Vectorized GPU(GTX 750Ti) code
    timeall(3,k1) = gputimeit(f);
    clear f
end

%// Display benchmark results
figure,hold on, grid on
plot(N_arr,timeall(1,:),'-b.')
plot(N_arr,timeall(2,:),'-ro')
plot(N_arr,timeall(3,:),'-kx')
legend('Original CPU','Vectorized CPU','Vectorized GPU (GTX 750 Ti)')
xlabel('Datasize (N) ->'),ylabel('Time(sec) ->')

结果

结果表明,向量化的GPU代码在数据量更大的情况下性能确实很好,并且从比向量化的CPU和原始代码慢的速度变成了向量化的CPU代码的两倍.

Results show that the vectorized GPU code performs really well with higher datasize and goes from slower than both the vectorized CPU and original code to being twice as fast as the vectorized CPU code.

这篇关于使用bsxfun和GPU在Matlab中向量化嵌套循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆