GPU实施中的顺序操作 [英] Sequential operation in GPU implementation

查看:231
本文介绍了GPU实施中的顺序操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须在GPU中实现以下算法

for(int I = 0; I < 1000; I++){
    VAR1[I+1] = VAR1[I] + VAR2[2*K+(I-1)];//K is a constant
}

每个迭代都依赖于以前的迭代,因此很难并行化.我不确定原子操作在这里是否有效.我该怎么办?

VAR1 VAR2 都是一维数组.

VAR1[0] = 1

解决方案

这属于称为重复关系.根据递归关系的结构,可能存在封闭形式的解决方案,这些解决方案描述了如何分别(即并行,无递归)计算每个元素.早期的开创性论文(关于并行计算)之一是简短教程为此提供了更多处理想法.

在您的情况下,让我们看看是否可以通过映射VAR1的前几个术语看起来像什么,然后将以前的术语替换为较新的术语来发现任何东西:

i      VAR1[i]
___________________
0        1
1        1 + VAR2[2K-1]
2        1 + VAR2[2K-1] + VAR2[2K]
3        1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1]
4        1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1] + VAR2[2K+2]
...

希望您能想到的是,上述VAR2[]术语遵循前缀总和的格式.

这意味着可以通过以下方式给出一种可能的解决方法:

VAR1[i] = 1+prefix_sum(VAR2[2K + (i-2)])   (for i > 0) notes:(1) (2)
VAR1[i] = 1                                (for i = 0)

现在,前缀总和可以并行完成(这不是真正的完全独立的操作,但是可以并行化.在这里,我不想对术语或纯度争论太多.我提供编写自己的. >

注意:

  1. 使用-1或-2作为i的前缀和的偏移量可能取决于您使用inclusiveexclusive扫描或前缀和操作.

  2. 必须在适当的域上定义
  3. VAR2才能使其合理.但是,该要求在您的问题陈述中是隐含的.

这是一个琐碎的例子.在这种情况下,由于VAR2索引项2K+(I-1)只是表示I(2K-1)的固定偏移量,因此我们仅将0偏移量用于演示目的,因此VAR2只是一个简单的数组在与VAR1相同的域中.为了演示,我将VAR2定义为所有1的数组. gpu并行计算发生在VAR1向量中,CPU等效计算只是在cpu变量中即时进行计算以用于验证:

$ cat t1056.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <iostream>

const int dsize = 1000;
using namespace thrust::placeholders;
int main(){

  thrust::device_vector<int> VAR2(dsize, 1);  // initialize VAR2 array to all 1's
  thrust::device_vector<int> VAR1(dsize);
  thrust::exclusive_scan(VAR2.begin(), VAR2.end(), VAR1.begin(), 0); // put prefix sum of VAR2 into VAR1
  thrust::transform(VAR1.begin(), VAR1.end(), VAR1.begin(),  _1 += 1);   // add 1 to every term
  int cpu = 1;
  for (int i = 1; i < dsize; i++){
    int gpu = VAR1[i];
    cpu += VAR2[i];
    if (cpu != gpu) {std::cout << "mismatch at: " << i << " was: " << gpu << " should be: " << cpu << std::endl; return 1;}
    }
  std::cout << "Success!" << std::endl;
  return 0;
}

$ nvcc -o t1056 t1056.cu
$ ./t1056
Success!
$

有关使用扫描操作解决线性递归问题的其他参考,请参阅Blelloch的论文 解决方案

This is in a category of problems called recurrence relations. Depending on the structure of the recurrence relation, there may exist closed form solutions that describe how to compute each element individually (i.e. in parallel, without recursion). One of the early seminal papers (on parallel computation) was Kogge and Stone, and there exist recipes and strategies for parallelizing specific forms.

Sometimes recurrence relations are so simple that we can identify a closed-form formula or algorithm with a little bit of "inspection". This short tutorial gives a little bit more treatment of this idea.

In your case, let's see if we can spot anything just by mapping out what the first few terms of VAR1 should look like, substituting previous terms into newer terms:

i      VAR1[i]
___________________
0        1
1        1 + VAR2[2K-1]
2        1 + VAR2[2K-1] + VAR2[2K]
3        1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1]
4        1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1] + VAR2[2K+2]
...

Hopefully what jumps out at you is that the VAR2[] terms above follow a pattern of a prefix sum.

This means one possible solution method could be given by:

VAR1[i] = 1+prefix_sum(VAR2[2K + (i-2)])   (for i > 0) notes:(1) (2)
VAR1[i] = 1                                (for i = 0)

Now, a prefix sum can be done in parallel (this is not truly a fully independent operation, but it can be parallelized. I don't want to argue too much about terminology or purity here. I'm offering one possible method of parallelization for your stated problem, not the only way to do it.) To do a prefix sum in parallel on the GPU, I would use a library like CUB or Thrust. Or you can write your own although I wouldn't recommend it.

Notes:

  1. the use of -1 or -2 as an offset to i for the prefix sum may be dictated by your use of an inclusive or exclusive scan or prefix sum operation.

  2. VAR2 must be defined over an appropriate domain to make this sensible. However that requirement is implicit in your problem statement.

Here is a trivial worked example. In this case, since the VAR2 indexing term 2K+(I-1) just represents a fixed offset to I (2K-1), we are simply using an offset of 0 for demonstration purposes, so VAR2 is just a simple array over the same domain as VAR1. And I am defining VAR2 to just be an array of all 1, for demonstration purposes. The gpu parallel computation occurs in the VAR1 vector, the CPU equivalent computation is just computed on-the-fly in the cpu variable for validation purposes:

$ cat t1056.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <iostream>

const int dsize = 1000;
using namespace thrust::placeholders;
int main(){

  thrust::device_vector<int> VAR2(dsize, 1);  // initialize VAR2 array to all 1's
  thrust::device_vector<int> VAR1(dsize);
  thrust::exclusive_scan(VAR2.begin(), VAR2.end(), VAR1.begin(), 0); // put prefix sum of VAR2 into VAR1
  thrust::transform(VAR1.begin(), VAR1.end(), VAR1.begin(),  _1 += 1);   // add 1 to every term
  int cpu = 1;
  for (int i = 1; i < dsize; i++){
    int gpu = VAR1[i];
    cpu += VAR2[i];
    if (cpu != gpu) {std::cout << "mismatch at: " << i << " was: " << gpu << " should be: " << cpu << std::endl; return 1;}
    }
  std::cout << "Success!" << std::endl;
  return 0;
}

$ nvcc -o t1056 t1056.cu
$ ./t1056
Success!
$

For an additional reference particular to the usage of scan operations to solve linear recurrence problems, refer to Blelloch's paper here section 1.4.

这篇关于GPU实施中的顺序操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆