嵌套循环中数组的二维累积和-CUDA实现? [英] Cumulative sum in two dimensions on array in nested loop -- CUDA implementation?

查看:93
本文介绍了嵌套循环中数组的二维累积和-CUDA实现?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在考虑如何使用缩减功能在CUDA上执行此操作,但是我对如何完成它有些困惑. C代码如下.要记住的重要部分-变量 precalculatedValue 取决于两者循环迭代器.另外,变量 ngo 并非对于 m 的每个值都是唯一的. m = 0,1,2可能具有 ngo = 1,而 m = 4,5,6,7,8可能具有 ngo = 2,等等.我提供了循环迭代器的大小,以防它提供更好的实现建议.

// macro that translates 2D [i][j] array indices to 1D flattened array indices
#define idx(i,j,lda) ( (j) + ((i)*(lda)) )

int Nobs = 60480;
int NgS  = 1859;
int NgO  = 900;
// ngo goes from [1,900]

// rInd is an initialized (and filled earlier) as:
// rInd = new long int [Nobs];

for (m=0; m<Nobs; m++) {        
    ngo=rInd[m]-1;

    for (n=0; n<NgS; n++) {
            Aggregation[idx(n,ngo,NgO)] += precalculatedValue;
    }
}

在以前的情况下,当 precalculatedValue 仅是内部循环变量的函数时,我将值保存在唯一的数组索引中,并在事后添加了并行归约(Thrust).但是,这种情况让我感到困惑: m 的值并不唯一地映射到 ngo 的值.因此,我看不出有一种使这种代码高效(甚至可行)的方法来减少使用.任何想法都是最欢迎的.

解决方案

只是一个刺...

我怀疑转置循环可能会有所帮助.

for (n=0; n<NgS; n++) {
    for (m=0; m<Nobs; m++) {            
        ngo=rInd[m]-1;
        Aggregation[idx(n,ngo,NgO)] += precalculatedValue(m,n);
    }
}

之所以这样做,是因为idxn相比,ngo(具有m的功能)变化更快,因此将m设为内部循环可提高相干性.注意,我还使precalculatedValue函数具有(m,n)的功能,因为您说过-这使伪代码更清晰.

然后,您可以从在主机上保留外循环开始,然后为内循环创建内核(64,480路并行度足以填充大多数当前的GPU).

在内部循环中,只需使用atomicAdd()开始处理冲突.如果它们很少出现,那么在Fermi GPU上的性能应该不会太差.无论如何,由于此计算的算术强度低,因此您将受到带宽限制.因此,一旦工作正常,请测量您要实现的带宽,然后将其与GPU的峰值进行比较.如果您还不满意,请然后考虑进一步优化(也许通过并行化外循环-每个线程块进行一次迭代,并使用一些共享的内存和线程合作优化来进行内循环).

关键:从简单开始,衡量性能,然后决定如何优化.

请注意,此计算看起来与直方图计算非常相似,而直方图计算也面临类似的挑战,因此您可能需要通过Google搜索GPU直方图来查看其实现方式.

一个想法是使用thrust::sort_by_key()对(rInd[m]m)对进行排序,然后(由于rInd重复项将被分组在一起),您可以遍历它们并进行还原,而不会发生冲突. (这是一种制作直方图的方法.)您甚至可以使用thrust::reduce_by_key()来做到这一点.

I have been thinking of how to perform this operation on CUDA using reductions, but I'm a bit at a loss as to how to accomplish it. The C code is below. The important part to keep in mind -- the variable precalculatedValue depends on both loop iterators. Also, the variable ngo is not unique to every value of m... e.g. m = 0,1,2 might have ngo = 1, whereas m = 4,5,6,7,8 could have ngo = 2, etc. I have included sizes of loop iterators in case it helps to provide better implementation suggestions.

// macro that translates 2D [i][j] array indices to 1D flattened array indices
#define idx(i,j,lda) ( (j) + ((i)*(lda)) )

int Nobs = 60480;
int NgS  = 1859;
int NgO  = 900;
// ngo goes from [1,900]

// rInd is an initialized (and filled earlier) as:
// rInd = new long int [Nobs];

for (m=0; m<Nobs; m++) {        
    ngo=rInd[m]-1;

    for (n=0; n<NgS; n++) {
            Aggregation[idx(n,ngo,NgO)] += precalculatedValue;
    }
}

In a previous case, when precalculatedValue was only a function of the inner loop variable, I saved the values in unique array indices and added them with a parallel reduction (Thrust) after the fact. However, this case has me stumped: the values of m are not uniquely mapped to the values of ngo. Thus, I don't see a way of making this code efficient (or even workable) to use a reduction on. Any ideas are most welcome.

解决方案

Just a stab...

I suspect that transposing your loops might help.

for (n=0; n<NgS; n++) {
    for (m=0; m<Nobs; m++) {            
        ngo=rInd[m]-1;
        Aggregation[idx(n,ngo,NgO)] += precalculatedValue(m,n);
    }
}

The reason I did this is because idx varies more rapidly with ngo (function of m) than with n, so making m the inner loop improves coherence. Note I also made precalculatedValue a function of (m, n) because you said that it is -- this makes the pseudocode clearer.

Then, you could start by leaving the outer loop on the host, and making a kernel for the inner loop (64,480-way parallelism is enough to fill most current GPUs).

In the inner loop, just start by using an atomicAdd() to handle collisions. If they are infrequent, on Fermi GPUs performance shouldn't be too bad. In any case, you will be bandwidth bound since arithmetic intensity of this computation is low. So once this is working, measure the bandwidth you are achieving, and compare to the peak for your GPU. If you are way off, then think about optimizing further (perhaps by parallelizing the outer loop -- one iteration per thread block, and do the inner loop using some shared memory and thread cooperation optimizations).

The key: start simple, measure performance, and then decide how to optimize.

Note that this calculation looks very similar to a histogram calculation, which has similar challenges, so you might want to google for GPU histograms to see how they have been implemented.

One idea is to sort (rInd[m], m) pairs using thrust::sort_by_key() and then (since the rInd duplicates will be grouped together), you can iterate over them and do your reductions without collisions. (This is one way to do histograms.) You could even do this with thrust::reduce_by_key().

这篇关于嵌套循环中数组的二维累积和-CUDA实现?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆