CUDA中具有复杂算法的合并内存访问和全局内存加载/存储效率 [英] Coalesced memory access and global memory load/store efficiency with complex arithmetics in CUDA

查看:76
本文介绍了CUDA中具有复杂算法的合并内存访问和全局内存加载/存储效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在分析以下CUDA内核

I'm profiling the following CUDA kernel

__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if (i < N1 && j < N2) {
        double a = pow(-1.0, (i+j)&1);

        data[j*blockDim.x*gridDim.x+i].x *= a;
        data[j*blockDim.x*gridDim.x+i].y *= a;
    }
 }

基本上是将二维双精度复数数据矩阵与标量双精度变量相乘.

which basically multiplies a 2D double precision complex data matrix by a scalar double precision variable.

可以看出,我正在执行合并的全局内存访问,我想通过NVIDIA Visual Profiler通过检查全局内存负载和存储效率来验证这一点.令人惊讶的是,这样的效率恰好都是50%,远不及合并的内存访问所期望的100%.这与复数的实部和虚部的隔行存储有关吗?如果是这样,我是否可以利用任何技巧来恢复100%的效率?

As it can be seen, I'm performing a coalesced global memory access and I want to verify this by the NVIDIA Visual Profiler by inspecting the global memory load and store efficiencies. Surprisingly, such efficiencies turn out to be both exactly 50%, far from the expected 100% for coalesced memory access. Is this related to the interlaced storage of real and imaginary parts for complex numbers? If so, is there any trick I could exploit to restore a 100% efficiency?

谢谢.

其他信息

BLOCK_SIZE_x=16
BLOCK_SIZE_y=16

dim3 dimBlock2(BLOCK_SIZE_x,BLOCK_SIZE_y);
dim3 dimGrid2(N2/BLOCK_SIZE_x + (N2%BLOCK_SIZE_x == 0 ? 0:1),N1/BLOCK_SIZE_y + (N1%BLOCK_SIZE_y == 0 ? 0:1));

N1和N2可以是任意偶数.

N1 and N2 can be arbitrary even numbers.

该卡是NVIDIA GT 540M.

The card is an NVIDIA GT 540M.

推荐答案

看看此NVIDIA博客发表有关各种内存访问模式效率的文章.您遇到了Strided Memory Access问题.

Take a look at this NVIDIA Blog Post about efficiency of various memory access patterns. You're hitting Strided Memory Access problem.

由于每个组件都是独立使用的,因此您可以将 double2 数组视为普通的普通 double 数组(就像

Since each component is used independently you could treat your double2 array as a plain normal double array instead (just like Robert Crovella suggested).

__global__ void fftshift_2D(double *data, int N1, int N2)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if (i < N1 * 2 && j < N2) {
        double a = pow(-1.0, (i / 2 + j)&1);
        data[j*blockDim.x*gridDim.x+i] *= a;
    }
}

但是如果您需要同时访问x&您可以尝试在单个线程中使用y个组件:

But if you ever need to access both x & y components in a single thread you could try:

使用2个单独的数组.一个带有x分量,一个带有y分量.像这样:

Using 2 separate arrays. One with x component one with y component. Like that:

__global__ void fftshift_2D(double *dataX, double *dataY, int N1, int N2)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if (i < N1 && j < N2) {
        double a = pow(-1.0, (i+j)&1);

        dataX[j*blockDim.x*gridDim.x+i] *= a;
        dataY[j*blockDim.x*gridDim.x+i] *= a;
    }
}

或者保持数据布局不变,但不加任何限制地将其加载到共享内存中,然后将其从共享内存中重新排列.看起来或多或少是这样的:

Or leaving the data layout as is but loading it with no stride into shared memory and reshuffling it from shared memory. That would look more or less like that:

__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
    __shared__ double buff[BLOCK_SIZE*2];
    double2 *buff2 = (double2 *) buff;
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;
    double ptr = (double *) &data[j*blockDim.x*gridDim.x + blockDim.x * blockIdx.x];

    // TODO add guarding with N1 & N2
    buff[threadIdx.x] = ptr[threadIdx.x];
    buff[blockDim.x + threadIdx.x] = ptr[blockDim.x + threadIdx.x];
    __syncthreads();

    double a = pow(-1.0, (i+j)&1);
    buff2[threadIdx.x].x *= a 
    buff2[threadIdx.x].y *= a 

    __syncthreads();
    ptr[threadIdx.x] = buff[threadIdx.x];
    ptr[blockDim.x + threadIdx.x] = buff[blockDim.x + threadIdx.x];
}

这篇关于CUDA中具有复杂算法的合并内存访问和全局内存加载/存储效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆