删除CUDA warp级还原中的__syncthreads() [英] Removing __syncthreads() in CUDA warp-level reduction

查看:244
本文介绍了删除CUDA warp级还原中的__syncthreads()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码将数组中的每个 32 元素与每个 32 元素组的第一个元素相加:

The following code sums every 32 elements in an array to the very first element of each 32 element group:

int i = threadIdx.x;
int warpid = i&31;
if(warpid < 16){
    s_buf[i] += s_buf[i+16];__syncthreads();
    s_buf[i] += s_buf[i+8];__syncthreads();
    s_buf[i] += s_buf[i+4];__syncthreads();
    s_buf[i] += s_buf[i+2];__syncthreads();
    s_buf[i] += s_buf[i+1];__syncthreads();
}



我想我可以消除所有的 __ syncthreads ,因为所有的操作都是在同一个warp中完成的。但如果我消除它们,我得到垃圾的结果。它不会影响性能太多,但我想知道为什么我需要 __ syncthreads()这里。

I thought I can eliminate all the __syncthreads() in the code, since all the operations are done in the same warp. But if I eliminate them, I get garbage results back. It shall not affect performance too much, but I want to know why I need __syncthreads() here.

推荐答案

我在这里提供答案,因为我认为上述两个不完全令人满意。这个答案的知识产权属于Mark Harris,他在这个演示文稿(幻灯片22)和@talonmies,他已在上述注释中将此问题指向OP。

I'm providing an answer here because I think that the above two are not fully satisfactory. The "intellectual property" of this answer belongs to Mark Harris, who has pointed out this issue in this presentation (slide 22), and to @talonmies, who has pointed this problem out to the OP in the comments above.

首先尝试恢复OP要求的,过滤他的错误。

Let me first try to resume what the OP was asking, filtering his mistakes.

OP似乎是通过循环展开来处理减少共享内存减少的最后一个步骤, warp减少。他正在做类似

The OP seems to be dealing with the last step of reduction in shared memory reduction, warp reduction by loop unrolling. He is doing something like

template <class T>
__device__ void warpReduce(T *sdata, int tid) {
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid + 8];
    sdata[tid] += sdata[tid + 4];
    sdata[tid] += sdata[tid + 2];
    sdata[tid] += sdata[tid + 1];
}

template <class T>
__global__ void reduce4_no_synchthreads(T *g_idata, T *g_odata, unsigned int N)
{
    extern __shared__ T sdata[];

    unsigned int tid    = threadIdx.x;                              // Local thread index
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;       // Global thread index - Fictitiously double the block dimension

    // --- Performs the first level of reduction in registers when reading from global memory. 
    T mySum = (i < N) ? g_idata[i] : 0;
    if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
    sdata[tid] = mySum;

    // --- Before going further, we have to make sure that all the shared memory loads have been completed
    __syncthreads();

    // --- Reduction in shared memory. Only half of the threads contribute to reduction.
    for (unsigned int s=blockDim.x/2; s>32; s>>=1)
    {
        if (tid < s) { sdata[tid] = mySum = mySum + sdata[tid + s]; }
        // --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
        __syncthreads();
    }

    // --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
    if (tid < 32) warpReduce(sdata, tid);

    // --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
    //     individual blocks
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

正如Mark Harris和talonmies指出的,共享内存变量 sdata 必须声明为 volatile ,以防止编译器优化。因此,定义上面 __ device __ 函数的正确方法是:

As pointed out by Mark Harris and talonmies, the shared memory variable sdata must be declared as volatile, to prevent compiler optimizations. So, the right way to define the __device__ function above is:

template <class T>
__device__ void warpReduce(volatile T *sdata, int tid) {
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid + 8];
    sdata[tid] += sdata[tid + 4];
    sdata[tid] += sdata[tid + 2];
    sdata[tid] += sdata[tid + 1];
}



现在让我们看看与上述两种情况相对应的反汇编代码,即, sdata 声明为 volatile volatile 对于费米建筑)。

Let us now see the disassembled codes corresponding to the two cases above examined, i.e., sdata declared as not volatile or volatile (code compiled for Fermi architecture).

volatile

    /*0000*/         MOV R1, c[0x1][0x100];                          /* 0x2800440400005de4 */
    /*0008*/         S2R R0, SR_CTAID.X;                             /* 0x2c00000094001c04 */
    /*0010*/         SHL R3, R0, 0x1;                                /* 0x6000c0000400dc03 */
    /*0018*/         S2R R2, SR_TID.X;                               /* 0x2c00000084009c04 */
    /*0020*/         IMAD R3, R3, c[0x0][0x8], R2;                   /* 0x200440002030dca3 */
    /*0028*/         IADD R4, R3, c[0x0][0x8];                       /* 0x4800400020311c03 */
    /*0030*/         ISETP.LT.U32.AND P0, PT, R3, c[0x0][0x28], PT;  /* 0x188e4000a031dc03 */
    /*0038*/         ISETP.GE.U32.AND P1, PT, R4, c[0x0][0x28], PT;  /* 0x1b0e4000a043dc03 */
    /*0040*/     @P0 ISCADD R3, R3, c[0x0][0x20], 0x2;               /* 0x400040008030c043 */
    /*0048*/    @!P1 ISCADD R4, R4, c[0x0][0x20], 0x2;               /* 0x4000400080412443 */
    /*0050*/    @!P0 MOV R5, RZ;                                     /* 0x28000000fc0161e4 */
    /*0058*/    @!P1 LD R4, [R4];                                    /* 0x8000000000412485 */
    /*0060*/     @P0 LD R5, [R3];                                    /* 0x8000000000314085 */
    /*0068*/         SHL R3, R2, 0x2;                                /* 0x6000c0000820dc03 */
    /*0070*/         NOP;                                            /* 0x4000000000001de4 */
    /*0078*/    @!P1 IADD R5, R4, R5;                                /* 0x4800000014416403 */
    /*0080*/         MOV R4, c[0x0][0x8];                            /* 0x2800400020011de4 */
    /*0088*/         STS [R3], R5;                                   /* 0xc900000000315c85 */
    /*0090*/         BAR.RED.POPC RZ, RZ, RZ, PT;                    /* 0x50ee0000ffffdc04 */
    /*0098*/         MOV R6, c[0x0][0x8];                            /* 0x2800400020019de4 */
    /*00a0*/         ISETP.LT.U32.AND P0, PT, R6, 0x42, PT;          /* 0x188ec0010861dc03 */
    /*00a8*/     @P0 BRA 0x118;                                      /* 0x40000001a00001e7 */
    /*00b0*/         NOP;                                            /* 0x4000000000001de4 */
    /*00b8*/         NOP;                                            /* 0x4000000000001de4 */
    /*00c0*/         MOV R6, R4;                                     /* 0x2800000010019de4 */
    /*00c8*/         SHR.U32 R4, R4, 0x1;                            /* 0x5800c00004411c03 */
    /*00d0*/         ISETP.GE.U32.AND P0, PT, R2, R4, PT;            /* 0x1b0e00001021dc03 */
    /*00d8*/    @!P0 IADD R7, R4, R2;                                /* 0x480000000841e003 */
    /*00e0*/    @!P0 SHL R7, R7, 0x2;                                /* 0x6000c0000871e003 */
    /*00e8*/    @!P0 LDS R7, [R7];                                   /* 0xc10000000071e085 */
    /*00f0*/    @!P0 IADD R5, R7, R5;                                /* 0x4800000014716003 */
    /*00f8*/    @!P0 STS [R3], R5;                                   /* 0xc900000000316085 */
    /*0100*/         BAR.RED.POPC RZ, RZ, RZ, PT;                    /* 0x50ee0000ffffdc04 */
    /*0108*/         ISETP.GT.U32.AND P0, PT, R6, 0x83, PT;          /* 0x1a0ec0020c61dc03 */
    /*0110*/     @P0 BRA 0xc0;                                       /* 0x4003fffea00001e7 */
    /*0118*/         ISETP.GT.U32.AND P0, PT, R2, 0x1f, PT;          /* 0x1a0ec0007c21dc03 */
    /*0120*/     @P0 BRA.U 0x198;                                    /* 0x40000001c00081e7 */
    /*0128*/    @!P0 LDS R8, [R3];                                   /* 0xc100000000322085 */
    /*0130*/    @!P0 LDS R5, [R3+0x80];                              /* 0xc100000200316085 */
    /*0138*/    @!P0 LDS R4, [R3+0x40];                              /* 0xc100000100312085 */
    /*0140*/    @!P0 LDS R7, [R3+0x20];                              /* 0xc10000008031e085 */
    /*0148*/    @!P0 LDS R6, [R3+0x10];                              /* 0xc10000004031a085 */
    /*0150*/    @!P0 IADD R8, R8, R5;                                /* 0x4800000014822003 */
    /*0158*/    @!P0 IADD R8, R8, R4;                                /* 0x4800000010822003 */
    /*0160*/    @!P0 LDS R5, [R3+0x8];                               /* 0xc100000020316085 */
    /*0168*/    @!P0 IADD R7, R8, R7;                                /* 0x480000001c81e003 */
    /*0170*/    @!P0 LDS R4, [R3+0x4];                               /* 0xc100000010312085 */
    /*0178*/    @!P0 IADD R6, R7, R6;                                /* 0x480000001871a003 */
    /*0180*/    @!P0 IADD R5, R6, R5;                                /* 0x4800000014616003 */
    /*0188*/    @!P0 IADD R4, R5, R4;                                /* 0x4800000010512003 */
    /*0190*/    @!P0 STS [R3], R4;                                   /* 0xc900000000312085 */
    /*0198*/         ISETP.NE.AND P0, PT, R2, RZ, PT;                /* 0x1a8e0000fc21dc23 */
    /*01a0*/     @P0 BRA.U 0x1c0;                                    /* 0x40000000600081e7 */
    /*01a8*/    @!P0 ISCADD R0, R0, c[0x0][0x24], 0x2;               /* 0x4000400090002043 */
    /*01b0*/    @!P0 LDS R2, [RZ];                                   /* 0xc100000003f0a085 */
    /*01b8*/    @!P0 ST [R0], R2;                                    /* 0x900000000000a085 */
    /*01c0*/         EXIT;                                           /* 0x8000000000001de7 */

/ * 0128 * / - / * 0148 * / / * 0160 * / / * 0170 * / 共享存储器加载到寄存器,线 / * 0190 * / 加载到共享存储器。中间线对应于在寄存器中执行的求和。因此,中间结果保存在寄存器中(每个线程都是私有的),而不是每次刷新到共享内存,从而防止线程对中间结果有完全的可见性。

Lines /*0128*/-/*0148*/, /*0160*/ and /*0170*/ correspond to the shared memory loads to registers and line /*0190*/ to the shared memory store from register. The intermediate lines correspond to the summations, as performed in registers. So, the intermediate results are kept in registers (which are private to each thread) and not flushed each time to shared memory, preventing the threads to have full visibility of the intermediate results.

volatile

    /*0000*/         MOV R1, c[0x1][0x100];                          /* 0x2800440400005de4 */
    /*0008*/         S2R R0, SR_CTAID.X;                             /* 0x2c00000094001c04 */
    /*0010*/         SHL R3, R0, 0x1;                                /* 0x6000c0000400dc03 */
    /*0018*/         S2R R2, SR_TID.X;                               /* 0x2c00000084009c04 */
    /*0020*/         IMAD R3, R3, c[0x0][0x8], R2;                   /* 0x200440002030dca3 */
    /*0028*/         IADD R4, R3, c[0x0][0x8];                       /* 0x4800400020311c03 */
    /*0030*/         ISETP.LT.U32.AND P0, PT, R3, c[0x0][0x28], PT;  /* 0x188e4000a031dc03 */
    /*0038*/         ISETP.GE.U32.AND P1, PT, R4, c[0x0][0x28], PT;  /* 0x1b0e4000a043dc03 */
    /*0040*/     @P0 ISCADD R3, R3, c[0x0][0x20], 0x2;               /* 0x400040008030c043 */
    /*0048*/    @!P1 ISCADD R4, R4, c[0x0][0x20], 0x2;               /* 0x4000400080412443 */
    /*0050*/    @!P0 MOV R5, RZ;                                     /* 0x28000000fc0161e4 */
    /*0058*/    @!P1 LD R4, [R4];                                    /* 0x8000000000412485 */
    /*0060*/     @P0 LD R5, [R3];                                    /* 0x8000000000314085 */
    /*0068*/         SHL R3, R2, 0x2;                                /* 0x6000c0000820dc03 */
    /*0070*/         NOP;                                            /* 0x4000000000001de4 */
    /*0078*/    @!P1 IADD R5, R4, R5;                                /* 0x4800000014416403 */
    /*0080*/         MOV R4, c[0x0][0x8];                            /* 0x2800400020011de4 */
    /*0088*/         STS [R3], R5;                                   /* 0xc900000000315c85 */
    /*0090*/         BAR.RED.POPC RZ, RZ, RZ, PT;                    /* 0x50ee0000ffffdc04 */
    /*0098*/         MOV R6, c[0x0][0x8];                            /* 0x2800400020019de4 */
    /*00a0*/         ISETP.LT.U32.AND P0, PT, R6, 0x42, PT;          /* 0x188ec0010861dc03 */
    /*00a8*/     @P0 BRA 0x118;                                      /* 0x40000001a00001e7 */
    /*00b0*/         NOP;                                            /* 0x4000000000001de4 */
    /*00b8*/         NOP;                                            /* 0x4000000000001de4 */
    /*00c0*/         MOV R6, R4;                                     /* 0x2800000010019de4 */
    /*00c8*/         SHR.U32 R4, R4, 0x1;                            /* 0x5800c00004411c03 */
    /*00d0*/         ISETP.GE.U32.AND P0, PT, R2, R4, PT;            /* 0x1b0e00001021dc03 */
    /*00d8*/    @!P0 IADD R7, R4, R2;                                /* 0x480000000841e003 */
    /*00e0*/    @!P0 SHL R7, R7, 0x2;                                /* 0x6000c0000871e003 */
    /*00e8*/    @!P0 LDS R7, [R7];                                   /* 0xc10000000071e085 */
    /*00f0*/    @!P0 IADD R5, R7, R5;                                /* 0x4800000014716003 */
    /*00f8*/    @!P0 STS [R3], R5;                                   /* 0xc900000000316085 */
    /*0100*/         BAR.RED.POPC RZ, RZ, RZ, PT;                    /* 0x50ee0000ffffdc04 */
    /*0108*/         ISETP.GT.U32.AND P0, PT, R6, 0x83, PT;          /* 0x1a0ec0020c61dc03 */
    /*0110*/     @P0 BRA 0xc0;                                       /* 0x4003fffea00001e7 */
    /*0118*/         ISETP.GT.U32.AND P0, PT, R2, 0x1f, PT;          /* 0x1a0ec0007c21dc03 */
    /*0120*/         SSY 0x1f0;                                      /* 0x6000000320000007 */
    /*0128*/     @P0 NOP.S;                                          /* 0x40000000000001f4 */
    /*0130*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*0138*/         LDS R4, [R3+0x80];                              /* 0xc100000200311c85 */
    /*0140*/         IADD R6, R5, R4;                                /* 0x4800000010519c03 */
    /*0148*/         STS [R3], R6;                                   /* 0xc900000000319c85 */
    /*0150*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*0158*/         LDS R4, [R3+0x40];                              /* 0xc100000100311c85 */
    /*0160*/         IADD R6, R5, R4;                                /* 0x4800000010519c03 */
    /*0168*/         STS [R3], R6;                                   /* 0xc900000000319c85 */
    /*0170*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*0178*/         LDS R4, [R3+0x20];                              /* 0xc100000080311c85 */
    /*0180*/         IADD R6, R5, R4;                                /* 0x4800000010519c03 */
    /*0188*/         STS [R3], R6;                                   /* 0xc900000000319c85 */
    /*0190*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*0198*/         LDS R4, [R3+0x10];                              /* 0xc100000040311c85 */
    /*01a0*/         IADD R6, R5, R4;                                /* 0x4800000010519c03 */
    /*01a8*/         STS [R3], R6;                                   /* 0xc900000000319c85 */
    /*01b0*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*01b8*/         LDS R4, [R3+0x8];                               /* 0xc100000020311c85 */
    /*01c0*/         IADD R6, R5, R4;                                /* 0x4800000010519c03 */
    /*01c8*/         STS [R3], R6;                                   /* 0xc900000000319c85 */
    /*01d0*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*01d8*/         LDS R4, [R3+0x4];                               /* 0xc100000010311c85 */
    /*01e0*/         IADD R4, R5, R4;                                /* 0x4800000010511c03 */
    /*01e8*/         STS.S [R3], R4;                                 /* 0xc900000000311c95 */
    /*01f0*/         ISETP.NE.AND P0, PT, R2, RZ, PT;                /* 0x1a8e0000fc21dc23 */
    /*01f8*/     @P0 BRA.U 0x218;                                    /* 0x40000000600081e7 */
    /*0200*/    @!P0 ISCADD R0, R0, c[0x0][0x24], 0x2;               /* 0x4000400090002043 */
    /*0208*/    @!P0 LDS R2, [RZ];                                   /* 0xc100000003f0a085 */
    /*0210*/    @!P0 ST [R0], R2;                                    /* 0x900000000000a085 */
    /*0218*/         EXIT;                                           /* 0x8000000000001de7 */

正如从 / * 0130 * / - / * 01e8 * / ,现在每次执行求和操作,中间结果立即被刷新到共享内存以获得全线程可见性。

As it can be seen from lines /*0130*/-/*01e8*/, now each time a summation is performed, the intermediate result is immediately flushed to shared memory for full thread visibility.

这篇关于删除CUDA warp级还原中的__syncthreads()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆