在Cuda中减少任意数量的元素 [英] Reduce in Cuda for arbitrary number of elements

查看:160
本文介绍了在Cuda中减少任意数量的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何实现以下链接中提供的代码版本7:
http:/ /www.cuvilib.com/Reduction.pdf

用于大小为任意数字的输入数组,换句话说,不是2的乘方?


,您应该查看相关NVIDIA CUDA的链接减少示例。它基本上包括您正在使用的pdf文件,但也包括实现减少1到7(标记为 reduce0 reduce6



如果你研究文档中的reduce 7的描述,你会看到初始减少步骤是通过while循环来处理的,网格循环通过内存。当循环通过存储器时,每个线程累积多个缩减元素。



这个初始while循环不限于特定大小的问题(例如2的幂)。



由于初始处理通过此while循环进行的缩减,后续步骤可以作为2的超高效率在线程块级别,如先前在该文献中所讨论的。但是初始输入集合大小不限于2的幂。



请研究CUDA示例中提供的代码( reduce6 )。


How can I implement version 7 of the code given in the following link: http://www.cuvilib.com/Reduction.pdf
for an input array whose size is an arbitrary number, in other words, not a power of 2?

解决方案

Version 7 already handles an arbitrary number of elements.

Perhaps instead of referring to the cuvilib link, you should look at the link to the relevant NVIDIA CUDA reduction sample. It includes essentially the pdf file you are using, but also sample codes that implement reductions 1 through 7 (labelled reduce0 through reduce6)

If you study the description of the reduction 7 in the document, you'll see that the initial reduction steps are handled via a while loop, that is causing the grid to loop through memory. As it loops through memory, each thread is accumulating multiple reduction elements.

This initial while loop is not limited to a particular size of problem (e.g. power of 2).

Due to the initial handling of the reduction via this while loop, later steps can be done as a super-efficient power of 2 at the threadblock level, as has been previously discussed in that document. But the initial input set size is not limited to a power of 2.

Please study the code given in the CUDA sample (reduce6).

这篇关于在Cuda中减少任意数量的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆