CUDA就地转置错误 [英] CUDA In-place Transpose Error

查看：217 发布时间：2017/3/4 15:14:04 cuda transpose in-place

本文介绍了CUDA就地转置错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在实施一个CUDA程序来转置图像。我创建了2个内核。第一个内核做的不是位置转置，对于任何图像大小都是完美的。

I'm implementing a CUDA program for transposing an image. I created 2 kernels. The first kernel does out of place transposition and works perfectly for any image size.

然后我创建了一个内核转换方形图像。但是，输出不正确。图像的下三角形被转置，但上三角形保持相同。生成的图像在对角线上有一个楼梯像模式，楼梯的每一步的大小等于我用于内核的二维块大小。

Then I created a kernel for in-place transposition of square images. However, the output is incorrect. The lower triangle of the image is transposed but the upper triangle remains the same. The resulting image has a stairs like pattern in the diagonal and the size of each step of the stairs is equal to the 2D block size which I used for my kernel.

对于任何图片尺寸，如果src和dst不同，效果非常好。

template<typename T, int blockSize>
__global__ void kernel_transpose(T* src, T* dst, int width, int height, int srcPitch, int dstPitch)
{
    __shared__ T block[blockSize][blockSize];

    int col = blockIdx.x * blockSize + threadIdx.x;
    int row = blockIdx.y * blockSize + threadIdx.y;

    if((col < width) && (row < height))
    {
        int tid_in = row * srcPitch + col;
        block[threadIdx.y][threadIdx.x] = src[tid_in];
    }

    __syncthreads();

    col = blockIdx.y * blockSize + threadIdx.x;
    row = blockIdx.x * blockSize + threadIdx.y;

    if((col < height) && (row < width))
    {
        int tid_out = row * dstPitch + col;
        dst[tid_out] = block[threadIdx.x][threadIdx.y];
    }
}

原地内核：

In-Place Kernel:

template<typename T, int blockSize>
__global__ void kernel_transpose_inplace(T* srcDst, int width, int pitch)
{
    __shared__ T block[blockSize][blockSize];

    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    int tid_in = row * pitch + col;
    int tid_out = col * pitch + row;

    if((row < width) && (col < width))
        block[threadIdx.x][threadIdx.y] = srcDst[tid_in];

    __threadfence();

    if((row < width) && (col < width))
        srcDst[tid_out] = block[threadIdx.x][threadIdx.y];
}

包装函数：

Wrapper Function:

int transpose_8u_c1(unsigned char* pSrcDst, int width,int pitch)
{
    //pSrcDst is allocated using cudaMallocPitch

    dim3 block(16,16);
    dim3 grid;
    grid.x = (width + block.x - 1)/block.x;
    grid.y = (width + block.y - 1)/block.y;

    kernel_transpose_inplace<unsigned char,16><<<grid,block>>>(pSrcDst,width,pitch);

    assert(cudaSuccess == cudaDeviceSynchronize());

    return 1;
}

错误输出：

Sample Input & Wrong Output:

我知道这个问题使用就地转置的逻辑。这是因为我的out of place转置内核对于不同的源和目的地完美地工作，如果我传递一个指针为源和目的地，也给出相同的错误的结果。

I know this problem has something to do with the logic of in-place transpose. This is because my out of place transpose kernel which is working perfectly for different source and destination, also gives the same wrong result if I pass it a single pointer for source and destination.

我做错了什么？帮助我纠正In-place内核。

What am I doing wrong? Help me in correcting the In-place kernel.

推荐答案

您的原地内核正在覆盖图像中的数据，由另一个线程拾取用于其转置操作。因此，对于正方形图像，您应该在覆盖之前缓冲目标数据，然后将目标数据放在其正确的转置位置。因为我们使用这种方法每个线程有效地执行2个副本，所以只需要使用一半的线程。这样的东西应该工作：

Your in-place kernel is overwriting data in the image that will be subsequently picked up by another thread to use for its transpose operation. So for a square image, you should buffer the destination data before overwriting it, then place the destination data in it's proper transposed location. Since we're doing effectively 2 copies per thread using this method, there's only a need to use half as many threads. Something like this should work:

template<typename T, int blockSize>
__global__ void kernel_transpose_inplace(T* srcDst, int width, int pitch)
{

    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    int tid_in = row * pitch + col;
    int tid_out = col * pitch + row;

    if((row < width) && (col < width) && (row<col)) {

        T temp = srcDst[tid_out];

        srcDst[tid_out] = srcDst[tid_in];
        srcDst[tid_in] = temp;
        }
}

这篇关于CUDA就地转置错误的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

CUDA就地转置错误 [英] CUDA In-place Transpose Error

问题描述

原地内核：

In-Place Kernel:

包装函数：

Wrapper Function:

错误输出：

Sample Input & Wrong Output:

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

CUDA就地转置错误 [英] CUDA In-place Transpose Error

问题描述

原地内核：

In-Place Kernel:

包装函数：

Wrapper Function:

错误输出：

Sample Input & Wrong Output:

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭