CUDA:减少或原子操作? [英] CUDA: reduction or atomic operations?

查看:673
本文介绍了CUDA:减少或原子操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个CUDA内核是指首先计算给定矩阵的最大值,我评估的可能性。我能找到的最好的办法是:

I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is:

强制每个线程要存储在共享存储器中的值,并使用还原算法之后,确定所述最大值(亲:最小发散缺点:共享存储器被限制为48KB上2.0设备)

Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices)

我不能使用原子操作,因为有两个读出和写入操作,所以线程不能由synchthreads同步。

I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads.

任何其他的想法进入你的脑海?

Any other idea come into your mind?

推荐答案

这是在CUDA进行裁减以通常的方式

This is the usual way to perform reductions in CUDA

在每个块,

1)保持运行的减少值,为每个线程共享内存。因此,每个线程将读取n(我16和32之间的个人赞成),从全局内存值和更新这些降低的价值

1) Keep a running reduced value in shared memory for each thread. Hence each thread will read n (I personally favor between 16 and 32), values from global memory and updates the reduced value from these

2)块内进行还原算法得到每块的最后一个降低的价值。

2) Perform the reduction algorithm within the block to get one final reduced value per block.

这样你就不会需要比(线程数量)*的sizeof(datatye)字节以上的共享内存。

This way you will not need more shared memory than (number of threads) * sizeof (datatye) bytes.

由于每块减少值,则需要进行第二次降低通,以获得最终值。

Since each block a reduced value, you will need to perform a second reduction pass to get the final value.

例如,如果您正在启动每块256个线程,而每个线程正在读16个值,你将能够减少(256 * 16 = 4096),每块元素。

For example, if you are launching 256 threads per block, and are reading 16 values per thread, you will be able to reduce (256 * 16 = 4096) elements per block.

所以,有了百万元,你将需要启动的第一阶段约250块,只是一个在第二块。

So given 1 million elements, you will need to launch around 250 blocks in the first pass, and just one block in the second.

您可能需要第三通行证情况下,当元件的数量>(4096)^ 2为此配置

You will probably need a third pass for cases when the number of elements > (4096)^2 for this configuration.

您将要照顾,全球内存读取被合并。你不能凝聚全球存储器写入,但是这是你需要一个性能命中。

You will have to take care that the global memory reads are coalesced. You can not coalesce global memory writes, but that is one performance hit you need to take.

这篇关于CUDA:减少或原子操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆