尝试理解前缀总和执行 [英] Trying to understand prefix sum execution

查看:130
本文介绍了尝试理解前缀总和执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解的扫描实现扫描然后风扇在书中提到:CUDA手册。


  1. 有人可以解释设备功能 scanWarp 吗?为什么是负指数? c>

    c。


  2. 这条线的贡献 if(warpid == 0){scanWarp< T,bZeroPadded>(16 + warpPartials + tid ); }

  3. 能否请人解释 sum + = warpPartials [16 + warpid-1]; ?将高度赞赏一个数字示例。

  4. 最后,面向更多c ++的问题,我们如何知道在 * sPartials = sum; 中使用的索引, sPartials

PS:显示整个执行的数值示例将非常有帮助。

  template< class T,bool bZeroPadded> 
inline __device__ T
scanBlock(volatile T * sPartials){


extern __shared__ T warpPartials [];
const int tid = threadIdx.x;
const int lane = tid& 31;
const int warpid = tid>> 5;

//
//计算此线程的部分和
//
T sum = scanWarp< T,bZeroPadded>(sPartials);
__syncthreads();

//
//将每个warp的缩写写入共享内存
//
if(lane == 31){
warpPartials [16 + warpid ] = sum;
}
__syncthreads();

//
//有一个经纱扫描减少
//
if(warpid == 0){
scanWarp< T,bZeroPadded> 16 + warpPartials + tid);
}
__syncthreads();

//
//展开排它的扫描元素(通过条件获得
//,并减1)
//到这个warp的未决输出
//
if(warpid> 0){
sum + = warpPartials [16 + warpid-1];
}
__syncthreads();

//
//写这个线程的扫描输出
//
* sPartials = sum;
__syncthreads();

//
//如果
//包含了spine值(即数组的减少
//),返回值只会被调用者使用我们只是扫描)。
//
return sum;
}


template<类T>
inline __device__ T
scanWarp(volatile T * sPartials){

const int tid = threadIdx.x;
const int lane = tid& 31;

if(lane> = 1)sPartials [0] + = sPartials [ - 1];
if(lane> = 2)sPartials [0] + = sPartials [ - 2];
if(lane> = 4)sPartials [0] + = sPartials [ - 4];
if(lane> = 8)sPartials [0] + = sPartials [ - 8];
if(lane> = 16)sPartials [0] + = sPartials [-16];

return sPartials [0];
}


解决方案

策略在两个层面上应用。对于网格级扫描(其在全局存储器上操作),将部分写入到在主机代码中分配的临时全局存储器缓冲器中,通过递归调用主机函数扫描,然后通过单独的内核调用添加到最终输出。对于块级扫描(其在共享存储器上操作),将偏移写入共享存储器的基础( warpPartials [] ),由一个翘曲扫描,然后添加到最终输出块级扫描。



1)执行 scanWarp 你引用的是使用已经添加了 threadIdx.x 的共享内存指针调用,因此每个线程的版本 sPartials 指向一个不同的共享内存元素。在 sPartials 上使用固定索引会导致相邻线程对相邻的共享内存元素进行操作。负指数是可以的,只要它们不会导致超出数组索引。这个实现从优化版本借用零共享内存,所以每个线程可以无条件使用固定的负索引和线程低于某个索引只读取零。 (清单13.14)它可以很容易地在warp中的最低线程上执行预测并使用正索引。



2)每个32线程warp的第31个线程包含翘曲的部分和,其必须存储在某处以便被扫描,然后添加到输出。 warpPartials [] 从第一个元素别名共享内存,因此可以用来保存每个warp的部分和。你可以使用共享内存的任何部分来做这个计算,因为每个线程在寄存器中都有自己的扫描值(赋值 T sum = scanWarp ... )。 / p>

3)有些翘曲(可以是任何翘曲,所以也可能是翘曲0)必须扫描写入 warpPartials [] 。最多需要一个翘曲,因为每块1024个线程的硬件限制= 1024/32或32个warp。因此,这段代码利用了每个warp的最大线程数除以warp数量不大于每个warp最大线程数的巧合。



4)此代码将扫描的每经线分数添加到每个输出元素。第一个弯曲已经具有正确的值,因此仅通过第二个和随后的经纱进行添加。另一种看待这种情况的方法是,将经线偏移的独立扫描添加到输出。



5) scanBlock 是一个设备函数 - 地址算术由其调用者完成, scanAndWritePartials volatile T * myShared = sPartials + tid;


I am trying to understand the scan implementation scan-then-fan mentioned in the book: The CUDA Handbook.

  1. Can some one explain the device function scanWarp? Why negative indexes? Could you please mention a numerical example?
  2. I have the same question about for the line warpPartials[16+warpid] = sum. How the assignment is happening?
  3. Which is the contribution of this line if ( warpid==0 ) {scanWarp<T,bZeroPadded>( 16+warpPartials+tid ); }
  4. Could you please someone explain sum += warpPartials[16+warpid-1]; ? An numerical example will be highly appreciated.
  5. Finally, a more c++ oriented question how do we know the indexes that are used in *sPartials = sum; to store values in sPartials?

PS: A numerical example that demonstrates the whole execution would be very helpful.

template < class T, bool bZeroPadded > 
inline __device__ T
scanBlock( volatile T *sPartials ){


   extern __shared__ T warpPartials[];
   const int tid = threadIdx.x;
   const int lane = tid & 31;
   const int warpid = tid >> 5;

   //
   // Compute this thread's partial sum
   //
   T sum = scanWarp<T,bZeroPadded>( sPartials );
   __syncthreads();

   //
   // Write each warp's reduction to shared memory
   // 
   if ( lane == 31 ) {
       warpPartials[16+warpid] = sum;
   }
   __syncthreads();

   //
   // Have one warp scan reductions
   //
   if ( warpid==0 ) {
       scanWarp<T,bZeroPadded>( 16+warpPartials+tid );
   }
   __syncthreads();

   //
   // Fan out the exclusive scan element (obtained
   // by the conditional and the decrement by 1)
   // to this warp's pending output
   //
   if ( warpid > 0 ) {
       sum += warpPartials[16+warpid-1];
   }
   __syncthreads();

   //
   // Write this thread's scan output
   //
   *sPartials = sum;
   __syncthreads();

   //
   // The return value will only be used by caller if it
   // contains the spine value (i.e. the reduction
   // of the array we just scanned).
   //
   return sum;
}


template < class T >
inline __device__ T 
scanWarp( volatile T *sPartials ){

   const int tid = threadIdx.x;
   const int lane = tid & 31;

   if ( lane >=  1 ) sPartials[0] += sPartials[- 1];
   if ( lane >=  2 ) sPartials[0] += sPartials[- 2];
   if ( lane >=  4 ) sPartials[0] += sPartials[- 4];
   if ( lane >=  8 ) sPartials[0] += sPartials[- 8];
   if ( lane >= 16 ) sPartials[0] += sPartials[-16];

   return sPartials[0];
}

解决方案

The scan-then-fan strategy is applied at two levels. For the grid-level scan (which operates on global memory), partials are written to the temporary global memory buffer allocated in the host code, scanned by recursively calling the host function, then added to the eventual output with a separate kernel invocation. For the block-level scan (which operates on shared memory), partials are written to the base of shared memory (warpPartials[]), scanned by one warp, then added to the eventual output of the block-level scan. The code that you are asking about is doing the block-level scan.

1) The implementation of scanWarp that you are referencing is called with a shared memory pointer that has already had threadIdx.x added to it, so each thread's version of sPartials points to a different shared memory element. Using a fixed index on sPartials causes adjacent threads to operate on adjacent shared memory elements. Negative indices are okay as long as they do not result in out-of-bounds array indexing. This implementation borrowed from the optimized version that pads shared memory with zeros, so every thread can unconditionally use a fixed negative index and threads below a certain index just read zeros. (Listing 13.14) It could just as easily have predicated execution on the lowest threads in the warp and used positive indices.

2) The 31st thread of each 32-thread warp contains that warp's partial sum, which has to be stored somewhere in order to be scanned and then added to the output. warpPartials[] aliases shared memory from the first element, so can be used to hold each warp's partial sum. You could use any part of shared memory to do this calculation, because each thread already has its own scan value in registers (the assignment T sum = scanWarp...).

3) Some warp (it could be any warp, so it might as well be warp 0) has to scan the partials that were written to warpPartials[]. At most one warp is needed because there is a hardware limitation of 1024 threads per block = 1024/32 or 32 warps. So this code is taking advantage of the coincidence that the maximum number of threads per warp, divided by the warp count, is no larger than the maximum number of threads per warp.

4) This code is adding the scanned per-warp partials to each output element. The first warp already has the correct values, so the addition is done only by the second and subsequent warps. Another way to look at this is that it's adding the exclusive scan of the warp partials to the output.

5) scanBlock is a device function - the address arithmetic gets done by its caller, scanAndWritePartials: volatile T *myShared = sPartials+tid;

这篇关于尝试理解前缀总和执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆