GPU实现中的顺序操作 [英] Sequential operation in GPU implementation

查看:20
本文介绍了GPU实现中的顺序操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须在 GPU 中实现以下算法

for(int I = 0; I <1000; I++){VAR1[I+1] = VAR1[I] + VAR2[2*K+(I-1)];//K为常数}

每次迭代都依赖于前一次,因此并行化很困难.我不确定原子操作在这里是否有效.我能做什么?

VAR1VAR2 都是一维数组.

VAR1[0] = 1

解决方案

这属于称为 重复关系.根据递推关系的结构,可能存在封闭形式的解决方案,描述如何单独计算每个元素(即并行,无递归).早期的开创性论文之一(关于并行计算)是 Kogge 和 Stone,并且存在并行化特定形式的方法和策略.

有时递推关系是如此简单,以至于我们可以通过一点检查"来确定一个封闭形式的公式或算法.这个简短教程对此进行了更多处理想法.

在你的情况下,让我们看看我们是否可以通过绘制 VAR1 的前几个术语应该是什么样子来发现任何东西,将以前的术语替换为新的术语:

i VAR1[i]_____0 11 1 + VAR2[2K-1]2 1 + VAR2[2K-1] + VAR2[2K]3 1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1]4 1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1] + VAR2[2K+2]...

希望上面的 VAR2[] 术语遵循 前缀总和.

这意味着可以通过以下方式给出一种可能的解决方法:

VAR1[i] = 1+prefix_sum(VAR2[2K + (i-2)]) (for i > 0) 注:(1) (2)VAR1[i] = 1(对于 i = 0)

现在,前缀求和可以并行完成(这不是真正的完全独立的操作,但可以并行化.我不想在这里过多争论术语或纯度.我提供一种可能的并行化方法,用于解决您提出的问题,而不是唯一的方法.)要在 GPU 上并行计算前缀和,我会使用像 CUB推力.或者您可以自己编写,尽管我不推荐这样做.>

注意事项:

  1. 使用 -1 或 -2 作为前缀和的 i 的偏移量可能取决于您使用的 inclusive排他扫描或前缀求和运算.

  2. VAR2 必须在适当的域上定义才能使其合理.但是,该要求隐含在您的问题陈述中.

这是一个简单的工作示例.在这种情况下,由于 VAR2 索引项 2K+(I-1) 仅表示对 I (2K-1),我们只是使用偏移量 0 进行演示,所以 VAR2 只是一个与 VAR1 位于同一域上的简单数组.为了演示目的,我将 VAR2 定义为所有 1 的数组.gpu 并行计算发生在 VAR1 向量中,CPU 等效计算只是在 cpu 变量中即时计算以进行验证:

$ cat t1056.cu#include #include #include #include <推力/transform.h>#include const int dsize = 1000;使用命名空间推力::占位符;int main(){推力::设备向量VAR2(dsize, 1);//将 VAR2 数组初始化为全 1推力::设备向量VAR1(dsize);推力::exclusive_scan(VAR2.begin(), VAR2.end(), VAR1.begin(), 0);//将 VAR2 的前缀和放入 VAR1推力::转换(VAR1.begin(), VAR1.end(), VAR1.begin(), _1 += 1);//每一项加1INT CPU = 1;for (int i = 1; i < dsize; i++){int gpu = VAR1[i];cpu += VAR2[i];if (cpu != gpu) {std::cout <<"不匹配:"<<我<<"是: "<<gpu<<"应该是:"<<处理器<

有关使用扫描操作解决线性递归问题的其他参考资料,请参阅 Blelloch 的论文 此处 1.4 节.这个问题/答案举例说明了如何实现该论文中的公式 1.5更一般的一阶递归情况.这个问题考虑二阶重复情况.>

I have to implement the following algorithm in GPU

for(int I = 0; I < 1000; I++){
    VAR1[I+1] = VAR1[I] + VAR2[2*K+(I-1)];//K is a constant
}

Each iteration is dependent on previous so the parallelizing is difficult. I am not sure if atomic operation is valid here. What can I do?

EDIT:

The VAR1 and VAR2 both are 1D array.

VAR1[0] = 1

解决方案

This is in a category of problems called recurrence relations. Depending on the structure of the recurrence relation, there may exist closed form solutions that describe how to compute each element individually (i.e. in parallel, without recursion). One of the early seminal papers (on parallel computation) was Kogge and Stone, and there exist recipes and strategies for parallelizing specific forms.

Sometimes recurrence relations are so simple that we can identify a closed-form formula or algorithm with a little bit of "inspection". This short tutorial gives a little bit more treatment of this idea.

In your case, let's see if we can spot anything just by mapping out what the first few terms of VAR1 should look like, substituting previous terms into newer terms:

i      VAR1[i]
___________________
0        1
1        1 + VAR2[2K-1]
2        1 + VAR2[2K-1] + VAR2[2K]
3        1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1]
4        1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1] + VAR2[2K+2]
...

Hopefully what jumps out at you is that the VAR2[] terms above follow a pattern of a prefix sum.

This means one possible solution method could be given by:

VAR1[i] = 1+prefix_sum(VAR2[2K + (i-2)])   (for i > 0) notes:(1) (2)
VAR1[i] = 1                                (for i = 0)

Now, a prefix sum can be done in parallel (this is not truly a fully independent operation, but it can be parallelized. I don't want to argue too much about terminology or purity here. I'm offering one possible method of parallelization for your stated problem, not the only way to do it.) To do a prefix sum in parallel on the GPU, I would use a library like CUB or Thrust. Or you can write your own although I wouldn't recommend it.

Notes:

  1. the use of -1 or -2 as an offset to i for the prefix sum may be dictated by your use of an inclusive or exclusive scan or prefix sum operation.

  2. VAR2 must be defined over an appropriate domain to make this sensible. However that requirement is implicit in your problem statement.

Here is a trivial worked example. In this case, since the VAR2 indexing term 2K+(I-1) just represents a fixed offset to I (2K-1), we are simply using an offset of 0 for demonstration purposes, so VAR2 is just a simple array over the same domain as VAR1. And I am defining VAR2 to just be an array of all 1, for demonstration purposes. The gpu parallel computation occurs in the VAR1 vector, the CPU equivalent computation is just computed on-the-fly in the cpu variable for validation purposes:

$ cat t1056.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <iostream>

const int dsize = 1000;
using namespace thrust::placeholders;
int main(){

  thrust::device_vector<int> VAR2(dsize, 1);  // initialize VAR2 array to all 1's
  thrust::device_vector<int> VAR1(dsize);
  thrust::exclusive_scan(VAR2.begin(), VAR2.end(), VAR1.begin(), 0); // put prefix sum of VAR2 into VAR1
  thrust::transform(VAR1.begin(), VAR1.end(), VAR1.begin(),  _1 += 1);   // add 1 to every term
  int cpu = 1;
  for (int i = 1; i < dsize; i++){
    int gpu = VAR1[i];
    cpu += VAR2[i];
    if (cpu != gpu) {std::cout << "mismatch at: " << i << " was: " << gpu << " should be: " << cpu << std::endl; return 1;}
    }
  std::cout << "Success!" << std::endl;
  return 0;
}

$ nvcc -o t1056 t1056.cu
$ ./t1056
Success!
$

For an additional reference particular to the usage of scan operations to solve linear recurrence problems, refer to Blelloch's paper here section 1.4. This question/answer gives an example of how to implement the equation 1.5 in that paper for a more general first-order recurrence case. This question considers the second-order recurrence case.

这篇关于GPU实现中的顺序操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆