在GPU中使用用户定义函数进行计算时,Matlab速度较慢 [英] Matlab is slow when using user defined function with calculation in GPU
问题描述
当我运行下面显示的代码时,函数内部的tic / toc对显示经过所有行需要很短的时间(<1sec)。然而,它实际上需要大约2.3秒才能获得输出!我使用tic / toc对来衡量时间。
tic
rnn.v = 11;
rnn.h = 101;
rnn.o = 7;
rnn.h_init = randn(1,rnn.h,'gpuArray');
rnn.W_vh = randn(rnn.v,rnn.h,'gpuArray');
rnn.W_hh = randn(rnn.h,rnn.h,'gpuArray');
rnn.W_ho = randn(rnn.h,rnn.o,'gpuArray');
inData.V = randn(10000,11,100,'gpuArray');
inData.TimeSteps = 100;
inData.BatchSize = 10000;
[H,OX] = forward_pass(rnn,inData)
toc
rnn和inData中的所有矩阵均为gpuArray,因此所有计算均在GPU中执行。输出也是gpuArray。
函数[H,OX] = forward_pass(rnn,inData)
tic;
%初始隐藏状态值
H_init = gpuArray(repmat(rnn.h_init,[inData.BatchSize,1]));
%初始化状态H
H =零(inData.BatchSize,rnn.h,inData.TimeSteps,'gpuArray');
初始化OX(这是H * Who)
OX =零(inData.BatchSize,rnn.o,inData.TimeSteps,'gpuArray');
for t = 1:inData.TimeSteps
if t == 1
HX_t = H_init * rnn.W_hh ...
+ inData。 V(:,:,t)* rnn.W_vh;
else
HX_t = H(:,:,(t-1))* rnn.W_hh ...
+ inData.V(:,:,t)* rnn.W_vh;
end
H(:,:,t)= tanh(HX_t);
OX(:,:,t)= H(:,:,t)* rnn.W_ho;
结束
toc;
end
通常情况下,如果使用gather()函数,将会很慢。我没有使用gather()函数将输出传送到工作区,我不知道为什么它仍然很慢。看起来最后一行结束需要超过2秒。
任何人都知道如何加速函数调用?
首先,为了进行适当的基准测试,您需要使用 gather
无论是在函数调用还是之后。在前一种情况下,函数调用会得到一个非gpu输出,而在后一种情况下,会有一个基于gpu的数据类型作为输出。现在,回到您的问题,您正在使用很少的 TimeSteps
,因此您可能尝试的任何优化都不会以巨大的方式反映出来。这是一个优化的版本,当您增加 Timesteps
-
函数[H,OX] = forward_pass(rnn,inData)
H =零(inData.BatchSize,rnn.h,inData.TimeSteps,'gpuArray');
T = reshape(permute(inData.V,[1 3 2]),[],size(inData.V,2))* rnn.W_vh;
H(:,:1)= tanh(bsxfun(@ plus,rnn.h_init * rnn.W_hh,T(1:size(inData.V,1),:)));
for t = 2:inData.TimeSteps
H(:,:t)= tanh(H(:,:,(t-1))* rnn.W_hh + .. $(t-1)* size(inData.V,1)+1:t * size(inData.V,1),:));
end
A = reshape(permute(H,[1 3 2]),[],size(H,2))* rnn.W_ho;
OX = permute(重塑(A,size(H,1),size(A,1)/ size(H,1),[]),[1 3 2]);
return;
标杆管理
测试用例#1
参数
rnn.v = 11;
rnn.h = 5;
rnn.o = 7;
inData.TimeSteps = 10000;
inData.BatchSize = 10;
结果
----原始代码:
已用时间为5.678876秒。
----修改后的代码:
已用时间为3.821059秒。
测试用例#2
参数
inData.TimeSteps = 50000; (其余与测试用例#1相同)
结果
----原始代码:
已用时间为28.392290秒。
----修改后的代码:
已用时间为19.031776秒。
请注意,这些都是在GTX 750 Ti上进行测试。
When I run the code shown below, the tic/toc pair inside the function shows it takes very short time (<< 1sec) to go through all the lines. However, it actually takes around 2.3secs to get the outputs!!! I use the tic/toc pair to measure the time.
tic
rnn.v = 11;
rnn.h = 101;
rnn.o = 7;
rnn.h_init = randn(1,rnn.h,'gpuArray');
rnn.W_vh = randn(rnn.v,rnn.h,'gpuArray');
rnn.W_hh = randn(rnn.h,rnn.h,'gpuArray');
rnn.W_ho = randn(rnn.h,rnn.o,'gpuArray');
inData.V = randn(10000,11,100,'gpuArray');
inData.TimeSteps =100;
inData.BatchSize = 10000;
[H,OX] = forward_pass(rnn, inData)
toc
All the matrices in rnn, and inData are gpuArray, so all the calculation are carried out in GPU. The outputs are also gpuArray.
function [H,OX] = forward_pass(rnn, inData)
tic;
%initial hidden state values
H_init = gpuArray(repmat(rnn.h_init,[inData.BatchSize,1]));
%initialize state H
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
%initialize OX (which is H * Who)
OX = zeros(inData.BatchSize, rnn.o, inData.TimeSteps,'gpuArray');
for t = 1 : inData.TimeSteps
if t == 1
HX_t = H_init * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
else
HX_t = H(:,:,(t-1)) * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
end
H(:,:,t) = tanh(HX_t);
OX(:,:,t) = H(:,:,t) * rnn.W_ho;
end
toc;
end
Normally, if you use gather() function, it will be slow. I didn't use the gather() function to transfer the outputs to workspace, I don't know why it is still so slow. It looks like the last line "end" takes more than 2secs.
Anyone knows how to accelerate the function call?
First off, for proper benchmarking you do need to use gather
either inside the function call or afterwards. In the former case, you would have a non-gpu output from the function call and in the latter case, a gpu-based datatype would be the output. Now, back to your problem, you are using very few TimeSteps
and as such any optimization that you might try out won't reflect in a huge manner. Here's an optimized version that will show increased performance as you increase Timesteps
-
function [H,OX] = forward_pass(rnn, inData)
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
T = reshape(permute(inData.V,[1 3 2]),[],size(inData.V,2))*rnn.W_vh;
H(:,:,1) = tanh(bsxfun(@plus,rnn.h_init * rnn.W_hh,T(1:size(inData.V,1),:)));
for t = 2 : inData.TimeSteps
H(:,:,t) = tanh( H(:,:,(t-1))*rnn.W_hh + ...
T((t-1)*size(inData.V,1)+1: t*size(inData.V,1),:));
end
A = reshape(permute(H,[1 3 2]),[],size(H,2))*rnn.W_ho;
OX = permute(reshape(A,size(H,1),size(A,1)/size(H,1),[]),[1 3 2]);
return;
Benchmarking
Test Case #1
Parameters
rnn.v = 11;
rnn.h = 5;
rnn.o = 7;
inData.TimeSteps = 10000;
inData.BatchSize = 10;
Results
---- Original Code :
Elapsed time is 5.678876 seconds.
---- Modified Code :
Elapsed time is 3.821059 seconds.
Test Case #2
Parameters
inData.TimeSteps = 50000; (rest are same as in Test Case #1)
Results
---- Original Code :
Elapsed time is 28.392290 seconds.
---- Modified Code :
Elapsed time is 19.031776 seconds.
Please note that these are tested on GTX 750 Ti.
这篇关于在GPU中使用用户定义函数进行计算时,Matlab速度较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!