CONFLICT_FREE_OFFSET宏用于GPU Gems 3的并行前缀算法 [英] CONFLICT_FREE_OFFSET macro used in the parallel prefix algorithm from GPU Gems 3
问题描述
首先,这里是指向算法的链接 http://http.developer .nvidia.com / GPUGems3 / gpugems3_ch39.html
First of all, here is the link to the algorithm http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
为了避免银行冲突,每隔NUM_BANKS将填充添加到共享内存数组(即32对于可计算性2.x)元素。这是通过(如图39-5):
In order to avoid bank conflicts, padding is added to the shared memory array every NUM_BANKS (i.e., 32 for devices of computability 2.x) elements. This is done by (as in Figure 39-5):
int ai = offset*(2*thid+1)-1
int bi = offset*(2*thid+2)-1
ai += ai/NUM_BANKS
bi += ai/NUM_BANKS
temp[bi] += temp[ai]
我不明白ai / NUM_BANKS如何等同于宏:
I don't understand how ai/NUM_BANKS is equivalent to the macro:
#define NUM_BANKS 16
#define LOG_NUM_BANKS 4
#define CONFLICT_FREE_OFFSET(n) \
((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))
不等于
n >> LOG_NUM_BANKS
任何帮助。感谢
推荐答案
我写了这段代码并合写了这篇文章,我要求你使用这篇文章来学习扫描算法,并且不要使用它中的代码。它是在CUDA是新的时候写的,我是CUDA的新手。如果您在CUDA中使用scan的现代实现,则不需要任何银行冲突避免。
I wrote that code and co-wrote the article, and I request that you use the article only for learning about scan algorithms, and do not use the code in it. It was written when CUDA was new, and I was new to CUDA. If you use a modern implementation of scan in CUDA you don't need any bank conflict avoidance.
如果要以简单的方式进行扫描,请使用 thrust :: inclusive_scan
或 thrust :: exclusive_scan
。
If you want to do scans the easy way, use thrust::inclusive_scan
or thrust::exclusive_scan
.
如果您真的想要实施扫描,请参阅最近的文章,例如这个 [1] 。或者对于一个真正的操作,代码更快,但需要更多的研究,这一个 [2] 。或者阅读 Sean Baxter的教程(尽管后者不包括对扫描算法的重要工作的引用)。
If you really want to implement a scan, refer to more recent articles such as this one [1]. Or for a real opus with faster code but that will require a bit more study, this one [2]. Or read Sean Baxter's tutorial (though the latter doesn't include citations of the seminal work on the scan algorithm).
[1] Shubhabrata Sengupta,Mark Harris,Michael Garland和John D. Owens。 Efficient Parallel Scan Algorithms for many-core GPU。在Jakub Kurzak,David A.Bader和Jack Dongarra,editors,Scientific Computing with Multicore and Accelerators,Chapman& Hall / CRC Computational Science,第19章,第413-442页。 Taylor& Francis,January 2011. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1041
[1] Shubhabrata Sengupta, Mark Harris, Michael Garland, and John D. Owens. "Efficient Parallel Scan Algorithms for many-core GPUs". In Jakub Kurzak, David A. Bader, and Jack Dongarra, editors, Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science, chapter 19, pages 413–442. Taylor & Francis, January 2011. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1041
[2] Merrill,D。和Grimshaw,A。平行扫描流架构。技术报告CS2009-14,弗吉尼亚大学计算机科学系。 2009年12月。
[2] Merrill, D. and Grimshaw, A. Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia. Dec. 2009.
这篇关于CONFLICT_FREE_OFFSET宏用于GPU Gems 3的并行前缀算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!