针对矢量化代码的GPU优化 [英] GPU optimization for vectorized code

查看:113
本文介绍了针对矢量化代码的GPU优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

function w=oja(X, varargin)

% get the dimensionality
[m n] = size(X);

% random initial weights
w = randn(m,1);

options = struct( ...
    'rate', .00005, ...
    'niter', 5000, ...
    'delta', .0001);
options = getopt(options, varargin);
success = 0;

% run through all input samples
for iter = 1:options.niter
    y = w'*X;
    for ii = 1:n       
        % y is a scalar, not a vector
        w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
    end
end
if (any(~isfinite(w)))
    warning('Lost convergence; lower learning rate?');
end

end

size(X)= 400 153600

此代码实现了oja的规则并且运行缓慢.我无法再对其进行矢量化处理.为了使其运行更快,我想在GPU上进行计算,因此我更改了

X=gpuArray(X)

但是代码运行得更慢.所使用的计算似乎与GPU兼容.请让我知道我的错误.

个人资料代码输出:

完整的详细信息: https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp = sharing

解决方案

这不是有关如何解决它的完整答案,而是更多解释了为什么GPU不能加速但实际上极大地降低了代码速度.

GPU出色地加快了并行代码的速度,这意味着它们可以同时执行很多操作(即我的GPU可以同时执行30070个操作,而现代CPU不能超过16个).但是,GPU处理器非常慢!如今,一个不错的CPU的速度约为2〜3Ghz,而一个现代GPU的速度为700Mhz.这意味着CPU的速度比GPU快得多,但是GPU可以同时做很多事情,因此可以赢得整体胜利.

一旦我看到它的解释是:您更喜欢一百万美元的跑车还是一辆小型摩托车?一百万美元的小汽车还是一千辆的小型摩托车?而且,如果您的工作是交付比萨饼怎么办?希望您能为最后一个回答一千个踏板车(除非您是踏板车迷,并且回答了所有踏板车中的踏板车,但这不是重点). (GPU的源代码和很好的介绍)

返回您的代码:您的代码是难以置信的顺序.每个内部迭代都取决于上一个迭代,并且与外部迭代相同.您不能并行运行其中的两个,因为您需要一次迭代的结果才能运行下一次迭代.这意味着,直到您交付了最后一个比萨饼,您才能得到比萨饼订单,因此,您想要的是尽快交付1个1个(因此,跑车更好!).

实际上,这1个线方程中的每个方程都非常快!如果我在计算机上运行了50个,则该行的时间为13.034秒,即每次迭代1.768个微秒(调用7680000次).

因此,您的问题不是您的代码运行缓慢,而是您多次调用它(很多次). GPU不会加速这行代码,因为它已经非常快了,并且我们知道CPU在这些方面比GPU快.

因此,不幸的是,GPU吸收了顺序代码,并且您的代码是非常顺序的,因此您不能使用GPU来加速. HPC都无济于事,因为每个循环迭代都依赖于上一个循环(没有parfor :().

所以,据我所知,您将需要处理它.

function w=oja(X, varargin)

% get the dimensionality
[m n] = size(X);

% random initial weights
w = randn(m,1);

options = struct( ...
    'rate', .00005, ...
    'niter', 5000, ...
    'delta', .0001);
options = getopt(options, varargin);
success = 0;

% run through all input samples
for iter = 1:options.niter
    y = w'*X;
    for ii = 1:n       
        % y is a scalar, not a vector
        w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
    end
end
if (any(~isfinite(w)))
    warning('Lost convergence; lower learning rate?');
end

end

size(X)= 400 153600

This code implements oja's rule and runs slow. I am not able to vectorize it any more. To make it run faster I wanted to do computations on the GPU, therefore I changed

X=gpuArray(X)

But the code instead ran slower. The computation used seems to be compatible with GPU. Please let me know my mistake.

Profile Code Output:

Complete details: https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp=sharing

解决方案

This is not a full answer on how to solve it, but more an explanation why GPUs does not speed up, but actually enormously slow down your code.

GPUs are fantastic to speed up code that is parallel, meaning that they can do A LOT of things at the same time (i.e. my GPU can do 30070 things at the same time, while a modern CPU cant go over 16). However, GPU processors are very slow! Nowadays a decent CPU has around 2~3Ghz speed while a modern GPU has 700Mhz. This means that a CPU is much faster than a GPU, but as GPUs can do lots of things at the same time they can win overall.

Once I saw it explained as: What do you prefer, A million dollar sports car or a scooter? A million dolar car or a thousand scooters? And what if your job is to deliver pizza? Hopefully you answered a thousand scooters for this last one (unless you are a scooter fan and you answered the scooters in all of them, but that's not the point). (source and good introduction to GPU)

Back to your code: your code is incredibly sequential. Every inner iteration depends in the previous one and the same with the outer iteration. You can not run 2 of these in parallel, as you need the result from one iteration to run the next one. This means that you will not get a pizza order until you have delivered the last one, thus what you want is to deliver 1 by 1, as fast as you can (so sports car is better!).

And actually, each of these 1 line equations is incredibly fast! If I run 50 of them in my computer I get 13.034 seconds on that line which is 1.69 microseconds per iteration (7680000 calls).

Thus your problem is not that your code is slow, is that you call it A LOT of times. The GPU will not accelerate this line of code, because it is already very fast, and we know that CPUs are faster than GPUs for these kind of things.

Thus, unfortunately, GPUs suck for sequential code and your code is very sequential, therefore you can not use GPUs to speed up. An HPC will neither help, because every loop iteration depends in the previous one (no parfor :( ).

So, as far I can say, you will need to deal with it.

这篇关于针对矢量化代码的GPU优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆