在GPU中使用用户定义函数进行计算时,Matlab速度较慢 [英] Matlab is slow when using user defined function with calculation in GPU

查看:203
本文介绍了在GPU中使用用户定义函数进行计算时,Matlab速度较慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我运行下面显示的代码时,函数内部的tic / toc对显示经过所有行需要很短的时间(<1sec)。然而,它实际上需要大约2.3秒才能获得输出!我使用tic / toc对来衡量时间。

  tic 

rnn.v = 11;
rnn.h = 101;
rnn.o = 7;
rnn.h_init = randn(1,rnn.h,'gpuArray');
rnn.W_vh = randn(rnn.v,rnn.h,'gpuArray');
rnn.W_hh = randn(rnn.h,rnn.h,'gpuArray');
rnn.W_ho = randn(rnn.h,rnn.o,'gpuArray');

inData.V = randn(10000,11,100,'gpuArray');
inData.TimeSteps = 100;
inData.BatchSize = 10000;

[H,OX] = forward_pass(rnn,inData)
toc

rnn和inData中的所有矩阵均为gpuArray,因此所有计算均在GPU中执行。输出也是gpuArray。

 函数[H,OX] = forward_pass(rnn,inData)
tic;
%初始隐藏状态值
H_init = gpuArray(repmat(rnn.h_init,[inData.BatchSize,1]));

%初始化状态H
H =零(inData.BatchSize,rnn.h,inData.TimeSteps,'gpuArray');

初始化OX(这是H * Who)
OX =零(inData.BatchSize,rnn.o,inData.TimeSteps,'gpuArray');

for t = 1:inData.TimeSteps

if t == 1
HX_t = H_init * rnn.W_hh ...
+ inData。 V(:,:,t)* rnn.W_vh;
else
HX_t = H(:,:,(t-1))* rnn.W_hh ...
+ inData.V(:,:,t)* rnn.W_vh;
end

H(:,:,t)= tanh(HX_t);
OX(:,:,t)= H(:,:,t)* rnn.W_ho;


结束

toc;
end

通常情况下,如果使用gather()函数,将会很慢。我没有使用gather()函数将输出传送到工作区,我不知道为什么它仍然很慢。看起来最后一行结束需要超过2秒。

任何人都知道如何加速函数调用?

解决方案

首先,为了进行适当的基准测试,您需要使用 gather 无论是在函数调用还是之后。在前一种情况下,函数调用会得到一个非gpu输出,而在后一种情况下,会有一个基于gpu的数据类型作为输出。现在,回到您的问题,您正在使用很少的 TimeSteps ,因此您可能尝试的任何优化都不会以巨大的方式反映出来。这是一个优化的版本,当您增加 Timesteps -

时,会显示更高的性能。

 函数[H,OX] = forward_pass(rnn,inData)

H =零(inData.BatchSize,rnn.h,inData.TimeSteps,'gpuArray');

T = reshape(permute(inData.V,[1 3 2]),[],size(inData.V,2))* rnn.W_vh;
H(:,:1)= tanh(bsxfun(@ plus,rnn.h_init * rnn.W_hh,T(1:size(inData.V,1),:)));

for t = 2:inData.TimeSteps
H(:,:t)= tanh(H(:,:,(t-1))* rnn.W_hh + .. $(t-1)* size(inData.V,1)+1:t * size(inData.V,1),:));
end

A = reshape(permute(H,[1 3 2]),[],size(H,2))* rnn.W_ho;
OX = permute(重塑(A,size(H,1),size(A,1)/ size(H,1),[]),[1 3 2]);

return;






标杆管理



测试用例#1



参数

  rnn.v = 11; 
rnn.h = 5;
rnn.o = 7;
inData.TimeSteps = 10000;
inData.BatchSize = 10;

结果

  ----原始代码:
已用时间为5.678876秒。
----修改后的代码:
已用时间为3.821059秒。

测试用例#2



参数

  inData.TimeSteps = 50000; (其余与测试用例#1相同)

结果

  ----原始代码:
已用时间为28.392290秒。
----修改后的代码:
已用时间为19.031776秒。

请注意,这些都是在GTX 750 Ti上进行测试。


When I run the code shown below, the tic/toc pair inside the function shows it takes very short time (<< 1sec) to go through all the lines. However, it actually takes around 2.3secs to get the outputs!!! I use the tic/toc pair to measure the time.

tic

rnn.v = 11;
rnn.h = 101;
rnn.o = 7;
rnn.h_init = randn(1,rnn.h,'gpuArray');
rnn.W_vh = randn(rnn.v,rnn.h,'gpuArray');
rnn.W_hh = randn(rnn.h,rnn.h,'gpuArray');
rnn.W_ho = randn(rnn.h,rnn.o,'gpuArray');

inData.V = randn(10000,11,100,'gpuArray');
inData.TimeSteps =100;
inData.BatchSize = 10000;

[H,OX] = forward_pass(rnn, inData)
toc

All the matrices in rnn, and inData are gpuArray, so all the calculation are carried out in GPU. The outputs are also gpuArray.

function [H,OX] = forward_pass(rnn, inData)
        tic;
        %initial hidden state values
        H_init = gpuArray(repmat(rnn.h_init,[inData.BatchSize,1]));

        %initialize state H
        H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');

        %initialize OX (which is H * Who)
        OX = zeros(inData.BatchSize, rnn.o, inData.TimeSteps,'gpuArray');

        for t = 1 : inData.TimeSteps

            if t == 1
                HX_t = H_init * rnn.W_hh... 
                        + inData.V(:,:,t) * rnn.W_vh;
            else
                HX_t = H(:,:,(t-1)) * rnn.W_hh... 
                        + inData.V(:,:,t) * rnn.W_vh;
            end

            H(:,:,t) = tanh(HX_t);
            OX(:,:,t) = H(:,:,t) * rnn.W_ho;


        end

        toc;
    end

Normally, if you use gather() function, it will be slow. I didn't use the gather() function to transfer the outputs to workspace, I don't know why it is still so slow. It looks like the last line "end" takes more than 2secs.

Anyone knows how to accelerate the function call?

解决方案

First off, for proper benchmarking you do need to use gather either inside the function call or afterwards. In the former case, you would have a non-gpu output from the function call and in the latter case, a gpu-based datatype would be the output. Now, back to your problem, you are using very few TimeSteps and as such any optimization that you might try out won't reflect in a huge manner. Here's an optimized version that will show increased performance as you increase Timesteps -

function [H,OX] = forward_pass(rnn, inData)

H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');

T = reshape(permute(inData.V,[1 3 2]),[],size(inData.V,2))*rnn.W_vh;
H(:,:,1) = tanh(bsxfun(@plus,rnn.h_init * rnn.W_hh,T(1:size(inData.V,1),:)));

for t = 2 : inData.TimeSteps
    H(:,:,t) = tanh( H(:,:,(t-1))*rnn.W_hh + ...
        T((t-1)*size(inData.V,1)+1: t*size(inData.V,1),:));
end

A = reshape(permute(H,[1 3 2]),[],size(H,2))*rnn.W_ho;
OX = permute(reshape(A,size(H,1),size(A,1)/size(H,1),[]),[1 3 2]);

return;


Benchmarking

Test Case #1

Parameters

rnn.v = 11;
rnn.h = 5;
rnn.o = 7;
inData.TimeSteps = 10000;
inData.BatchSize = 10;

Results

---- Original Code :
Elapsed time is 5.678876 seconds.
---- Modified Code :
Elapsed time is 3.821059 seconds.

Test Case #2

Parameters

inData.TimeSteps = 50000; (rest are same as in Test Case #1)

Results

---- Original Code :
Elapsed time is 28.392290 seconds.
---- Modified Code :
Elapsed time is 19.031776 seconds.

Please note that these are tested on GTX 750 Ti.

这篇关于在GPU中使用用户定义函数进行计算时,Matlab速度较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆