CUDA中的未对齐地址 [英] Misaligned address in CUDA

查看:490
本文介绍了CUDA中的未对齐地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都可以告诉我在CUDA内核下面的代码有什么问题:

Can anyone tell me whats wrong with the following code inside a CUDA kernel:

__constant__ unsigned char MT[256] = {
    0xde, 0x6f, 0x6f, 0xb1, 0xde, 0x6f, 0x6f, 0xb1, 0x91, 0xc5, 0xc5, 0x54, 0x91, 0xc5, 0xc5, 0x54,....};

typedef unsinged int U32;

__global__ void Kernel (unsigned int  *PT, unsigned int  *CT, unsigned int  *rk)
{

    long int i;
    __shared__ unsigned char sh_MT[256];    

    for (i = 0; i < 64; i += 4)
        ((U32*)sh_MT)[threadIdx.x + i] = ((U32*)MT)[threadIdx.x + i];

    __shared__ unsigned int sh_rkey[4];
    __shared__ unsigned int sh_state_pl[4];
    __shared__ unsigned int sh_state_ct[4];

    sh_state_pl[threadIdx.x] = PT[threadIdx.x];
    sh_rkey[threadIdx.x] = rk[threadIdx.x];
    __syncthreads();


    sh_state_ct[threadIdx.x] = ((U32*)sh_MT)[sh_state_pl[threadIdx.x]]^\
    ((U32*)(sh_MT+3))[((sh_state_pl[(1 + threadIdx.x) % 4] >> 8) & 0xff)] ^ \
    ((U32*)(sh_MT+2))[((sh_state_pl[(2 + threadIdx.x) % 4] >> 16) & 0xff)] ^\
    ((U32*)(sh_MT+1))[((sh_state_pl[(3 + threadIdx.x) % 4] >> 24) & 0xff )];


    CT[threadIdx.x] = sh_state_ct[threadIdx.x];
}

在此行代码中,

((U32*)(sh_MT+3))......

CUDA调试器给我的错误消息:
未对齐的地址

The CUDA debugger gives me the error message : misaligned address

如何修复此错误?

我在MVSC中使用CUDA 7,我使用1个Block和4个线程执行内核函数,如下所示:

I am using CUDA 7 in MVSC and i use 1 Block and 4 threads for executing the Kernel Function as follow:

__device__ unsigned int *state;
__device__ unsigned int *key;
__device__ unsigned int *ct;
.
.
main()
{
cudaMalloc((void**)&state, 16);
cudaMalloc((void**)&ct, 16);
cudaMalloc((void**)&key, 16);
//cudamemcpy(copy some values to => state , ct, key);   
Kernel << <1, 4 >> >(state, ct, key); 
}

请记住,我无法更改我的MT表类型。
预先感谢任何建议或回答。

Remember please, I can't change my "MT Table" type. Thanks in advance for any advice or answer .

推荐答案

错误消息告诉你,指针不对齐到处理器所需的边界。

As the error message tells you, the pointer is not aligned to the boundary required by the processor.

CUDA编程指南,第5.3.2节


全局存储器指令支持读取或写入大小等于1,2,4,8或16字节的字。当且仅当数据类型的大小是1,2,4,8或16字节时,对驻留在全局存储器中的数据的任何访问(经由变量或指针)编译为单个全局存储器指令,并且数据自然地(即其地址是该大小的倍数)。

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).

这是调试器试图告诉你的:您不能取消引用指向32位值而不是以32位边界对齐的地址的指针。

This is what the debugger is trying to tell you: Basically, you shouldn't dereference a pointer pointing to a 32-bit value from an address not aligned at a 32-bit boundary.

您可以 (U32 *)(sh_MT)(U32 *)(sh_MT + 4)很好,但不是 *)(sh_MT + 3)或这样。

You can do (U32*)(sh_MT) and (U32*)(sh_MT+4) just fine, but not (U32*)(sh_MT+3) or such.

您可能必须分别读取字节并将它们连接在一起。

You probably have to read the bytes separately and join them together.

这篇关于CUDA中的未对齐地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆