分析我的CUDA内核的内存访问合并 [英] Analyzing memory access coalescing of my CUDA kernel

查看:179
本文介绍了分析我的CUDA内核的内存访问合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过BS_x * BS_Y线程将内容移动到共享内存中读取(BS_X + 1)*(BS_Y + 1)全局内存位置,并已开发了以下代码。

I would like to read (BS_X+1)*(BS_Y+1) global memory locations by BS_x*BS_Y threads moving the contents to the shared memory and I have developed the following code.

int i       = threadIdx.x;
int j       = threadIdx.y;
int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

int index1  = j*BLOCK_SIZE_Y+i;

int i1      = (index1)%(BLOCK_SIZE_X+1);
int j1      = (index1)/(BLOCK_SIZE_Y+1);

int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);

__shared__ double Ezx_h_shared_ext[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];     

Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];

if ((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1))) 
Ezx_h_shared_ext[i2][j2]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j2)*xdim+(blockIdx.x*BLOCK_SIZE_X+i2)];



在我的理解中,合并是顺序处理的连续内存读取的并行等效。如果全局内存访问合并,如何检测?我注意到有从(i1,j1)到(i2,j2)的索引跳跃。
提前感谢。

In my understanding, coalescing is the parallel equivalent of consecutive memory reads of sequential processing. How can I detect now if the global memory accesses are coalesced? I remark that there is an index jump from (i1,j1) to (i2,j2). Thanks in advance.

推荐答案

我已经用手写的合并分析器评估了代码的内存访问。评估显示代码较少利用合并。这里是您可能会发现有用的合并分析器:

I've evaluated the memory accesses of your code with a hand-written coalescing analyzer. The evaluation shows the code less exploits the coalescing. Here is the coalescing analyzer that you may find useful:

#include <stdio.h>
#include <malloc.h>

typedef struct dim3_t{
    int x;
    int y;
} dim3;


// KERNEL LAUNCH PARAMETERS
#define GRIDDIMX 4
#define GRIDDIMY 4
#define BLOCKDIMX 16
#define BLOCKDIMY 16


// ARCHITECTURE DEPENDENT
// number of threads aggregated for coalescing
#define COALESCINGWIDTH 32
// number of bytes in one coalesced transaction
#define CACHEBLOCKSIZE 128
#define CACHE_BLOCK_ADDR(addr,size)  (addr*size)&(~(CACHEBLOCKSIZE-1))


int main(){
    // fixed dim3 variables
    // grid and block size
    dim3 blockDim,gridDim;
    blockDim.x=BLOCKDIMX;
    blockDim.y=BLOCKDIMY;
    gridDim.x=GRIDDIMX;
    gridDim.y=GRIDDIMY;

    // counters
    int unq_accesses=0;
    int *unq_addr=(int*)malloc(sizeof(int)*COALESCINGWIDTH);
    int total_unq_accesses=0;

    // iter over total number of threads
    // and count the number of memory requests (the coalesced requests)
    int I, II, III;
    for(I=0; I<GRIDDIMX*GRIDDIMY; I++){
        dim3 blockIdx;
        blockIdx.x = I%GRIDDIMX;
        blockIdx.y = I/GRIDDIMX;
        for(II=0; II<BLOCKDIMX*BLOCKDIMY; II++){
            if(II%COALESCINGWIDTH==0){
                // new coalescing bunch
                total_unq_accesses+=unq_accesses;
                unq_accesses=0;
            }
            dim3 threadIdx;
            threadIdx.x=II%BLOCKDIMX;
            threadIdx.y=II/BLOCKDIMX;

            ////////////////////////////////////////////////////////
            // Change this section to evaluate different accesses //
            ////////////////////////////////////////////////////////
            // do your indexing here
            #define BLOCK_SIZE_X BLOCKDIMX
            #define BLOCK_SIZE_Y BLOCKDIMY
            #define xdim 32
            int i       = threadIdx.x;
            int j       = threadIdx.y;
            int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
            int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

            int index1  = j*BLOCK_SIZE_Y+i;

            int i1      = (index1)%(BLOCK_SIZE_X+1);
            int j1      = (index1)/(BLOCK_SIZE_Y+1);

            int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
            int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
            // calculate the accessed location and offset here
            // change the line "Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];" to
            int addr = (blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1);
            int size = sizeof(double);
            //////////////////////////
            // End of modifications //
            //////////////////////////

            printf("tid (%d,%d) from blockid (%d,%d) accessing to block %d\n",threadIdx.x,threadIdx.y,blockIdx.x,blockIdx.y,CACHE_BLOCK_ADDR(addr,size));
            // check whether it can be merged with existing requests or not
            short merged=0;
            for(III=0; III<unq_accesses; III++){
                if(CACHE_BLOCK_ADDR(addr,size)==CACHE_BLOCK_ADDR(unq_addr[III],size)){
                    merged=1;
                    break;
                }
            }
            if(!merged){
                // new cache block accessed over this coalescing width
                unq_addr[unq_accesses]=CACHE_BLOCK_ADDR(addr,size);
                unq_accesses++;
            }
        }
    }
    printf("%d threads make %d memory transactions\n",GRIDDIMX*GRIDDIMY*BLOCKDIMX*BLOCKDIMY, total_unq_accesses);
}

代码将为网格的每个线程运行,并计算合并数请求,内存访问合并的度量。

The code will run for every thread of the grid and calculates the number of merged requests, metric of memory access coalescing.

要使用分析器,请将代码的索引计算部分粘贴到指定区域,并将内存访问(数组)地址和大小。我已经为你的代码执行这里的索引是:

To use the analyzer, paste the index calculation portion of your code in the specified region and decompose the memory accesses (array) into 'address' and 'size'. I've already done this for your code where the indexings are:

int i       = threadIdx.x;
int j       = threadIdx.y;
int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

int index1  = j*BLOCK_SIZE_Y+i;

int i1      = (index1)%(BLOCK_SIZE_X+1);
int j1      = (index1)/(BLOCK_SIZE_Y+1);

int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);

,内存访问为:

Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];

分析器报告4096个线程访问4064个高速缓存块。运行实际网格和块大小的代码,并分析合并行为。

The analyzer reports 4096 threads access to 4064 cache blocks. Run the code for your actual grid and block size and analyze the coalescing behavior.

这篇关于分析我的CUDA内核的内存访问合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆