CONFLICT_FREE_OFFSET宏用于GPU Gems 3的并行前缀算法 [英] CONFLICT_FREE_OFFSET macro used in the parallel prefix algorithm from GPU Gems 3

查看:282
本文介绍了CONFLICT_FREE_OFFSET宏用于GPU Gems 3的并行前缀算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,这里是指向算法的链接 http://http.developer .nvidia.com / GPUGems3 / gpugems3_ch39.html

First of all, here is the link to the algorithm http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

为了避免银行冲突,每隔NUM_BANKS将填充添加到共享内存数组(即32对于可计算性2.x)元素。这是通过(如图39-5):

In order to avoid bank conflicts, padding is added to the shared memory array every NUM_BANKS (i.e., 32 for devices of computability 2.x) elements. This is done by (as in Figure 39-5):

int ai = offset*(2*thid+1)-1
int bi = offset*(2*thid+2)-1
ai += ai/NUM_BANKS
bi += ai/NUM_BANKS
temp[bi] += temp[ai]

我不明白ai / NUM_BANKS如何等同于宏:

I don't understand how ai/NUM_BANKS is equivalent to the macro:

   #define NUM_BANKS 16  
   #define LOG_NUM_BANKS 4  
   #define CONFLICT_FREE_OFFSET(n) \  
          ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))  

不等于

n >> LOG_NUM_BANKS

任何帮助。感谢

推荐答案

我写了这段代码并合写了这篇文章,我要求你使用这篇文章来学习扫描算法,并且不要使用它中的代码。它是在CUDA是新的时候写的,我是CUDA的新手。如果您在CUDA中使用scan的现代实现,则不需要任何银行冲突避免。

I wrote that code and co-wrote the article, and I request that you use the article only for learning about scan algorithms, and do not use the code in it. It was written when CUDA was new, and I was new to CUDA. If you use a modern implementation of scan in CUDA you don't need any bank conflict avoidance.

如果要以简单的方式进行扫描,请使用 thrust :: inclusive_scan thrust :: exclusive_scan

If you want to do scans the easy way, use thrust::inclusive_scan or thrust::exclusive_scan.

如果您真的想要实施扫描,请参阅最近的文章,例如这个 [1] 。或者对于一个真正的操作,代码更快,但需要更多的研究,这一个 [2] 。或者阅读 Sean Baxter的教程(尽管后者不包括对扫描算法的重要工作的引用)。

If you really want to implement a scan, refer to more recent articles such as this one [1]. Or for a real opus with faster code but that will require a bit more study, this one [2]. Or read Sean Baxter's tutorial (though the latter doesn't include citations of the seminal work on the scan algorithm).

[1] Shubhabrata Sengupta,Mark Harris,Michael Garland和John D. Owens。 Efficient Parallel Scan Algorithms for many-core GPU。在Jakub Kurzak,David A.Bader和Jack Dongarra,editors,Scientific Computing with Multicore and Accelerators,Chapman& Hall / CRC Computational Science,第19章,第413-442页。 Taylor& Francis,January 2011. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1041

[1] Shubhabrata Sengupta, Mark Harris, Michael Garland, and John D. Owens. "Efficient Parallel Scan Algorithms for many-core GPUs". In Jakub Kurzak, David A. Bader, and Jack Dongarra, editors, Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science, chapter 19, pages 413–442. Taylor & Francis, January 2011. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1041

[2] Merrill,D。和Grimshaw,A。平行扫描流架构。技术报告CS2009-14,弗吉尼亚大学计算机科学系。 2009年12月。

[2] Merrill, D. and Grimshaw, A. Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia. Dec. 2009.

这篇关于CONFLICT_FREE_OFFSET宏用于GPU Gems 3的并行前缀算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆