在Cuda中实现Max Reduce [英] Implementing Max Reduce in Cuda

查看:177
本文介绍了在Cuda中实现Max Reduce的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在学习Cuda,我仍然掌握并行性。我现在遇到的问题是对值数组实现max reduce。这是我的内核

  __ global__ void max_reduce(const float * const d_array,
float * d_max,
const size_t elements)
{
extern __shared__ float shared [];

int tid = threadIdx.x;
int gid =(blockDim.x * blockIdx.x)+ tid;

if(gid< elements)
shared [tid] = d_array [gid];
__syncthreads();

for(unsigned int s = blockDim.x / 2; s> 0; s>> = 1)
{
if(tid& gid< elements)
shared [tid] = max(shared [tid],shared [tid + s]);
__syncthreads();
}

if(gid == 0)
* d_max = shared [tid];
}



我已经实现了一个min reduce使用相同的方法



为了测试内核,我使用一个串行for循环找到了最小值和最大值。最小和最大值总是在内核中相同,但只有最小减少匹配。



有什么明显我错过/做错了吗?

解决方案

您已删除的答案中的主要结论是正确的:您发布的内核不了解结束的内核执行,你已经做了很多的总体减少,但是结果不是很完整。每个块的结果必须组合(以某种方式)。正如评论中指出的,您的代码还有一些其他问题。让我们来看看它的修改版本:

  __ device__ float atomicMaxf(float * address,float val)
{
int * address_as_int =(int *)address;
int old = * address_as_int,assume;
while(val> __int_as_float(old)){
assume = old;
old = atomicCAS(address_as_int,assume,
__float_as_int(val));
}
return __int_as_float(old);
}


__global__ void max_reduce(const float * const d_array,float * d_max,
const size_t elements)
{
extern __shared__ float shared [];

int tid = threadIdx.x;
int gid =(blockDim.x * blockIdx.x)+ tid;
shared [tid] = -FLOAT_MAX; // 1

if(gid< elements)
shared [tid] = d_array [gid];
__syncthreads();

for(unsigned int s = blockDim.x / 2; s> 0; s>> = 1)
{
if(tid& gid< elements)
shared [tid] = max(shared [tid],shared [tid + s]); // 2
__syncthreads();
}
//现在该怎么办?
//选项1:保存块结果并启动另一个内核
if(tid == 0)
d_max [blockIdx.x] = shared [tid]; // 3
//选项2:使用atomics
if(tid == 0)
atomicMaxf(d_max,shared [0]);
}




  1. Pavan表示,共享内存阵列。如果 gridDim.x * blockDim.x 大于元素。

  2. 请注意,在这一行中,即使我们检查线程操作( gid )是否小于 gid ,以便索引到共享中时,将内存我们仍然可以在最后一个块中将合法值外部复制到共享内存中。因此,我们需要注释1中指定的共享内存初始化。

  3. 正如您已经发现的,您的最后一行是不正确的。每个块产生它自己的结果,我们必须以某种方式组合它们。一个方法,你可以考虑如果启动的块数量很少(更多在此后)是使用 atomics 。通常我们操纵人们远离使用原子,因为它们在执行时间方面是昂贵的。然而,我们面临的另一个选择是将块结果保存在全局内存中,完成内核,然后可能启动另一个内核来合并单个块结果。如果我最初启动了大量的块(比如说大于1024),那么如果我遵循这种方法,我可能会最终启动两个额外的内核。因此考虑原子。如所示,对于浮动没有原生 atomicMax 函数,但是如文档,您可以使用 atomicCAS 生成任意任意原子函数,我提供了一个示例在中为 float 提供原子最大值的

但是运行1024个或更多的原子函数(每个块一个)最好的方法?可能不是。



当启动线程块的内核时,我们真的只需要启动足够的线程块来保持机器忙。作为一个经验法则,我们希望每个SM至少运行4-8条经线,有些更多可能是个好主意。但是从机器利用率角度来看,最初启动数千个线程块并没有特别的好处。如果我们选择一个数字,如每个SM 8个线程块,我们在我们的GPU中最多有14-16个SM,这给我们一个相对较少的8 * 14 = 112线程块。让我们为一个好的轮数选择128(8 * 16)。没有什么神奇的,这只是足以保持GPU的忙。如果我们使这128个线程块中的每一个做额外的工作来解决整个问题,我们可以利用我们的原子的使用没有(或许)付出太多的代价这样做,并避免多个内核发射。那么这将是怎样的呢?:

  __ device__ float atomicMaxf(float * address,float val)
{
int * address_as_int =(int *)address;
int old = * address_as_int,assume;
while(val> __int_as_float(old)){
assume = old;
old = atomicCAS(address_as_int,assume,
__float_as_int(val));
}
return __int_as_float(old);
}


__global__ void max_reduce(const float * const d_array,float * d_max,
const size_t elements)
{
extern __shared__ float shared [];

int tid = threadIdx.x;
int gid =(blockDim.x * blockIdx.x)+ tid;
shared [tid] = -FLOAT_MAX;

while(gid< elements){
shared [tid] = max(shared [tid],d_array [gid]);
gid + = gridDim.x * blockDim.x;
}
__syncthreads();
gid =(blockDim.x * blockIdx.x)+ tid; // 1
for(unsigned int s = blockDim.x / 2; s> 0; s>> = 1)
{
if(tid< s& gid < element)
shared [tid] = max(shared [tid],shared [tid + s]);
__syncthreads();
}

if(tid == 0)
atomicMaxf(d_max,shared [0]);
}

使用这个修改的内核,在创建内核启动时,许多线程块基于总体数据大小( elements )启动。相反,我们启动固定数量的块(比如说,128,你可以修改这个数字来找出运行最快的),并让每个线程块(因此整个网格)循环通过内存,计算每个元素的最大操作数共享内存。然后,在标记为注释1的行中,我们必须将 gid 变量重新设置为其初始值。这实际上是不必要的,如果我们保证网格的大小( gridDim.x * blockDim.x )小于<$



请注意,当使用这个原子方法时,需要初始化将结果(在这种情况下为 * d_max )更改为适当的值,例如 -FLOAT_MAX



同样,我们通常指导人们使用原子,但在这种情况下,如果我们仔细管理它是值得考虑的,它允许我们节省额外的内核启动的开销。 p>

对于如何进行快速平行缩减的忍者水平分析,请看看Mark Harris的优秀白皮书,该白皮书可以在相关的 CUDA示例


I've been learning Cuda and I am still getting to grips with parallelism. The problem I am having at the moment is implementing a max reduce on an array of values. This is my kernel

__global__ void max_reduce(const float* const d_array,
                     float* d_max,
                     const size_t elements)
{
    extern __shared__ float shared[];

    int tid = threadIdx.x;
    int gid = (blockDim.x * blockIdx.x) + tid;

    if (gid < elements)
        shared[tid] = d_array[gid];
    __syncthreads();

    for (unsigned int s=blockDim.x/2; s>0; s>>=1) 
    {
        if (tid < s && gid < elements)
            shared[tid] = max(shared[tid], shared[tid + s]);
        __syncthreads();
    }

    if (gid == 0)
        *d_max = shared[tid];
}

I have implemented a min reduce using the same method (replacing the max function with the min) which works fine.

To test the kernel, I found the min and max values using a serial for loop. The min and max values always come out the same in the kernel but only the min reduce matches up.

Is there something obvious I'm missing/doing wrong?

解决方案

Your main conclusion in your deleted answer was correct: the kernel you have posted doesn't comprehend the fact that at the end of that kernel execution, you have done a good deal of the overall reduction, but the results are not quite complete. The results of each block must be combined (somehow). As pointed out in the comments, there are a few other issues with your code as well. Let's take a look at a modified version of it:

__device__ float atomicMaxf(float* address, float val)
{
    int *address_as_int =(int*)address;
    int old = *address_as_int, assumed;
    while (val > __int_as_float(old)) {
        assumed = old;
        old = atomicCAS(address_as_int, assumed,
                        __float_as_int(val));
        }
    return __int_as_float(old);
}


__global__ void max_reduce(const float* const d_array, float* d_max, 
                                              const size_t elements)
{
    extern __shared__ float shared[];

    int tid = threadIdx.x;
    int gid = (blockDim.x * blockIdx.x) + tid;
    shared[tid] = -FLOAT_MAX;  // 1

    if (gid < elements)
        shared[tid] = d_array[gid];
    __syncthreads();

    for (unsigned int s=blockDim.x/2; s>0; s>>=1) 
    {
        if (tid < s && gid < elements)
            shared[tid] = max(shared[tid], shared[tid + s]);  // 2
        __syncthreads();
    }
    // what to do now?
    // option 1: save block result and launch another kernel
    if (tid == 0)        
        d_max[blockIdx.x] = shared[tid]; // 3
    // option 2: use atomics
    if (tid == 0)
      atomicMaxf(d_max, shared[0]);
}

  1. As Pavan indicated, you need to initialize your shared memory array. The last block launched may not be a "full" block, if gridDim.x*blockDim.x is greater than elements.
  2. Note that in this line, even though we are checking that the thread operating (gid) is less than elements, when we add s to gid for indexing into the shared memory we can still index outside of the legitimate values copied into shared memory, in the last block. Therefore we need the shared memory initialization indicated in note 1.
  3. As you already discovered, your last line was not correct. Each block produces it's own result, and we must combine them somehow. One method you might consider if the number of blocks launched is small (more on this later) is to use atomics. Normally we steer people away from using atomics since they are "costly" in terms of execution time. However, the other option we are faced with is saving the block result in global memory, finishing the kernel, and then possibly launching another kernel to combine the individual block results. If I have launched a large number of blocks initially (say more than 1024) then if I follow this methodology I might end up launching two additional kernels. Thus the consideration of atomics. As indicated, there is no native atomicMax function for floats, but as indicated in the documentation, you can use atomicCAS to generate any arbitrary atomic function, and I have provided an example of that in atomicMaxf which provides an atomic max for float.

But is running 1024 or more atomic functions (one per block) the best way? Probably not.

When launching kernels of threadblocks, we really only need to launch enough threadblocks to keep the machine busy. As a rule of thumb we want at least 4-8 warps operating per SM, and somewhat more is probably a good idea. But there's no particular benefit from a machine utilization standpoint to launch thousands of threadblocks initially. If we pick a number like 8 threadblocks per SM, and we have at most, say, 14-16 SMs in our GPU, this gives us a relatively small number of 8*14 = 112 threadblocks. Let's choose 128 (8*16) for a nice round number. There's nothing magical about this, it's just enough to keep the GPU busy. If we make each of these 128 threadblocks do additional work to solve the whole problem, we can then leverage our use of atomics without (perhaps) paying too much of a penalty for doing so, and avoid multiple kernel launches. So how would this look?:

__device__ float atomicMaxf(float* address, float val)
{
    int *address_as_int =(int*)address;
    int old = *address_as_int, assumed;
    while (val > __int_as_float(old)) {
        assumed = old;
        old = atomicCAS(address_as_int, assumed,
                        __float_as_int(val));
        }
    return __int_as_float(old);
}


__global__ void max_reduce(const float* const d_array, float* d_max, 
                                              const size_t elements)
{
    extern __shared__ float shared[];

    int tid = threadIdx.x;
    int gid = (blockDim.x * blockIdx.x) + tid;
    shared[tid] = -FLOAT_MAX; 

    while (gid < elements) {
        shared[tid] = max(shared[tid], d_array[gid]);
        gid += gridDim.x*blockDim.x;
        }
    __syncthreads();
    gid = (blockDim.x * blockIdx.x) + tid;  // 1
    for (unsigned int s=blockDim.x/2; s>0; s>>=1) 
    {
        if (tid < s && gid < elements)
            shared[tid] = max(shared[tid], shared[tid + s]);
        __syncthreads();
    }

    if (tid == 0)
      atomicMaxf(d_max, shared[0]);
}

With this modified kernel, when creating the kernel launch, we are not deciding how many threadblocks to launch based on the overall data size (elements). Instead we are launching a fixed number of blocks (say, 128, you can modify this number to find out what runs fastest), and letting each threadblock (and thus the entire grid) loop through memory, computing partial max operations on each element in shared memory. Then, in the line marked with comment 1, we must re-set the gid variable to it's initial value. This is actually unnecessary and the block reduction loop code can be further simplified if we guarantee that the size of the grid (gridDim.x*blockDim.x) is less than elements, which is not difficult to do at kernel launch.

Note that when using this atomic method, it's necessary to initialize the result (*d_max in this case) to an appropriate value, like -FLOAT_MAX.

Again, we normally steer people way from atomic usage, but in this case, it's worth considering if we carefully manage it, and it allows us to save the overhead of an additional kernel launch.

For a ninja-level analysis of how to do fast parallel reductions, take a look at Mark Harris' excellent whitepaper which is available with the relevant CUDA sample.

这篇关于在Cuda中实现Max Reduce的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆