将char *转换为CUDA中的unsigned int * [英] Cast char* to unsigned int* in CUDA

查看:1257
本文介绍了将char *转换为CUDA中的unsigned int *的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类型定义为 typedef unsigned char uint40 [5] 然后我有一个数组uint40,说 uint40 * payloads



我试图将以下函数移植到CUDA内核中

  void aSimpleFunction(int M,uint40 * data)
{
for(auto i = 0; i< M; i ++)
{
unsigned int * dataPtr =(unsigned int *)data [i];
* dataPtr = 2158677232;
data [i] [4] = 1;
}
}

对我来说这很简单, t工作。但是,使用方括号可以访问每个元素。

  __ global__ void aSimpleKernel(int M,uint40 * data)
{
int tid = threadIdx.x + 1;

//不工作
unsigned int * dataPtr =(unsigned int *)data [tid];
* dataPtr = 16976944;
// WORKS
/ *
data [threadIdx.x] [0] = tid * 1;
data [threadIdx.x] [1] = tid * 2;
data [threadIdx.x] [2] = tid * 3;
data [threadIdx.x] [3] = tid * 4;
* /
data [threadIdx.x] [4] = 2;
}

可以将char *转换为CUDA中的unsigned int *内核?



通过did not work我的意思是,它有随机数,而不是打印uint40 *数组的每个元素时我真正期望的。有时,GPU显然崩溃,因为窗口弹出窗口告诉我gpu重新启动成功。

解决方案

对于CUDA代码,遇到问题,最好使用正确的cuda错误检查,并使用 cuda-memcheck 运行您的代码。即使您不明白错误输出,这对于那些试图帮助您的人也是有用的,所以我建议在之前要求帮助。



我试图用一个完整的代码来显示你的代码是这样的:

  #include< stdio.h> 

typedef unsigned char uint40 [5];


void aSimpleFunction(int M,uint40 * data)
{
for(int i = 0; i unsigned int * dataPtr =(unsigned int *)data [i];
* dataPtr = 0x02020202U;
data [i] [4] = 1;
}
}

void uint40_print(uint40& data){

char * my_data =(char *)& data;
for(int i = 0; i <5; i ++)printf(%d,my_data [i]);
printf(\\\
);
}

__global__ void aSimpleKernel(int M,uint40 * data)
{
for(int i = 0; i {
unsigned int * dataPtr =(unsigned int *)data [i];
printf(%p \\\
,dataPtr);
* dataPtr = 0x02020202U;
data [i] [4] = 1;
}
}

int main(){

uint40 * payloads =(uint40 *)malloc
memset(payloads,0,10000);
aSimpleFunction(5,有效载荷);
uint40_print(payloads [0]);
memset(payloads,0,10000);
uint40 * d_payloads;
cudaMalloc(& d_payloads,10000);
aSimpleKernel<<< 1,1>>>(5,d_payloads);
cudaMemcpy(payloads,d_payloads,10000,cudaMemcpyDeviceToHost);
for(int i = 0; i <5; i ++)uint40_print(payloads [i]);
return 0;
}



当我编译并运行该代码时,我得到如下输出:

  $ ./t1091 
22221
00000
$

确定,GPU输出与CPU输出不匹配。如果我使用 cuda-memcheck 运行代码,我得到的输出结果如下:

  $ cuda-memcheck ./t1091 
========= CUDA-MEMCHECK
22221
=========无效__global__ write of size 4
========= at 0x00000080 in /home/bob/misc/t1091.cu:28:aSimpleKernel(int,unsigned char [5] *)
= ========通过线程(0,0,0)在块(0,0,0)
=========地址0x402500005未对齐

这给出了一个线索的实际问题。实际上,你创建一个 char 数组,然后在它上面叠加一个5字节宽的结构(uint40)。这意味着连续的 uint40 项目将从以5开始的字节地址开始。



地址并将其转换为 int unsigned int 指针,则可能会出现未对齐的指针。 CUDA要求POD数据类型的所有访问都发生在 自然对齐边界。因此,必须在4字节边界上访问32位数量(例如 int float 等) (0,4,8,...)。 uint40 (0,5,10,...)的许多5字节边界不会落在4字节的边界上,因此尝试访问4



一个可能的解决方案,对于这个特定使用示例,假设你传递给内核的指针是 cudaMalloc (用于对齐)返回的指针只是更改您的typedef:

  typedef unsigned char uint40 [8]; 

这会强制每个 uint40 一个8字节的边界,也是一个4字节的边界。这种情况的副作用是每分配8个分配3个未使用的字节。



在你的情况下,你表示 uint40 type是数据的集合,而不是单个数值量,因此它实际上是一个数据结构,每个元素占用5个字节。这种结构的阵列将有效地是AoS(结构阵列)存储格式,并且对这种数据的性能的公共变换是将其转换为SoA(阵列结构)存储格式。因此,另一种可能的方法是创建两个数组:

  typedef unsigned char uint40a [4]; 
typedef unsigned char uint40b [1];
uint40a * data1;
uint40b * data2;
cudaMalloc(& data1,size);
cudaMalloc(& data2,size);

并以这种方式访问​​您的数据。这将保持存储密度,并且几乎可以确保与您的5字节结构相比更快地访问您的数据在GPU。



如果从上面有任何疑问,你不能选择一个任意的 char 指针,将它转换为另一个(更大)的数据类型,并期望有好的事情发生。您使用的指针必须正确对齐所引用的数据类型。


I have a type defined as typedef unsigned char uint40[5] and then I have an array of uint40, say uint40* payloads

I was trying to port the following function into a CUDA kernel

void aSimpleFunction(int M, uint40* data)
{
    for (auto i = 0; i < M; i++)
    {
        unsigned int* dataPtr = (unsigned int*)data[i];
        *dataPtr = 2158677232;
        data[i][4] = 1;
    }
}

To me it was as simple as but it didn't work. However, using square-brackets to access to each of the elements it does work.

__global__ void aSimpleKernel(int M, uint40* data)
{
    int tid = threadIdx.x + 1;

    // DOESN'T WORK
    unsigned int* dataPtr = (unsigned int*)data[tid];
    *dataPtr = 16976944;
    // WORKS
    /*
    data[threadIdx.x][0] = tid * 1;
    data[threadIdx.x][1] = tid * 2;
    data[threadIdx.x][2] = tid * 3;
    data[threadIdx.x][3] = tid * 4;
    */
    data[threadIdx.x][4] = 2;
}

Is it possible to cast a char* into a unsigned int* in a CUDA kernel?

By "didn't work" I mean, it has random numbers instead of what I really expect when printing each of the elements of the uint40* array. Sometimes, the GPU apparently crashes since there is a pop up in windows telling me the gpu restarted successfully.

解决方案

Any time you're having trouble with a CUDA code, it's a good idea to use proper cuda error checking and run your code with cuda-memcheck. Even if you don't understand the error output, it will be useful for those trying to help you, so I suggest doing that before asking for help here.

My attempt to make a complete code out of what you haven shown was like this:

#include <stdio.h>

typedef unsigned char uint40[5];


void aSimpleFunction(int M, uint40* data)
{
    for (int i = 0; i < M; i++)
    {
        unsigned int* dataPtr = (unsigned int*)data[i];
        *dataPtr = 0x02020202U;
        data[i][4] = 1;
    }
}

void uint40_print(uint40 &data){

  char *my_data = (char *)&data;
  for (int i = 0; i < 5; i++) printf("%d", my_data[i]);
  printf("\n");
}

__global__ void aSimpleKernel(int M, uint40* data)
{
    for (int i = 0; i < M; i++)
    {
        unsigned int* dataPtr = (unsigned int*)data[i];
        printf("%p\n", dataPtr);
        *dataPtr = 0x02020202U;
        data[i][4] = 1;
    }
}

int main(){

  uint40 *payloads = (uint40 *)malloc(10000);
  memset(payloads, 0, 10000);
  aSimpleFunction(5, payloads);
  uint40_print(payloads[0]);
  memset(payloads, 0, 10000);
  uint40 *d_payloads;
  cudaMalloc(&d_payloads, 10000);
  aSimpleKernel<<<1,1>>>(5, d_payloads);
  cudaMemcpy(payloads, d_payloads, 10000, cudaMemcpyDeviceToHost);
  for (int i = 0; i < 5; i++) uint40_print(payloads[i]);
  return 0;
}

When I compile and run that code I get output like this:

$ ./t1091
22221
00000
$

sure enough, the GPU output doesn't match the CPU output. If I run the code with cuda-memcheck, a portion of the output I get looks like this:

$ cuda-memcheck ./t1091
========= CUDA-MEMCHECK
22221
========= Invalid __global__ write of size 4
=========     at 0x00000080 in /home/bob/misc/t1091.cu:28:aSimpleKernel(int, unsigned char[5]*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x402500005 is misaligned

And this gives a clue to the actual problem. In effect you are creating a char array and then superimposing a 5-byte-wide structure (uint40) on it. This means that consecutive uint40 items will start at byte addresses that differ by 5.

When you take one of these addresses and cast it to a int or unsigned int pointer, you may end up with a misaligned pointer. CUDA requires all accesses of POD data types to occur on naturally aligned boundaries. So a 32-bit quantity (e.g. int, float, etc.) must be accessed on a 4-byte boundary (0, 4, 8, ...). Many of the 5-byte boundaries for uint40 (0, 5, 10, ...) don't also fall on 4-byte boundaries, so attempting to access a 4-byte quantity that way is illegal.

One possible solution, for this particular usage example, and assuming the pointer you pass to the kernel is a pointer that is returned by cudaMalloc (for alignment), is just to change your typedef:

typedef unsigned char uint40[8];

This forces every uint40 item to fall on an 8-byte boundary, which is also a 4-byte boundary. A side effect of this would be allocating 3 unused bytes out of every 8 allocated.

In your case, you indicated that the uint40 type was a collection of data, not a single numerical quantity, so it is effectively a data "structure" that happens to occupy 5 bytes per element. An array of such "structures" would effectively be AoS (array of structures) storage format, and a common transformation on such data for performance is to convert it to an SoA (structure of arrays) storage format. Therefore another possible approach would be to create two arrays:

typedef unsigned char uint40a[4];
typedef unsigned char uint40b[1];
uint40a *data1;
uint40b *data2;
cudaMalloc(&data1, size);
cudaMalloc(&data2, size);

and access your data in this fashion. This will maintain the storage density and almost certainly provide faster access to your data in the GPU as compared to your 5-byte structure.

If there is any doubt from the above, you cannot pick up an arbitrary char pointer, cast it to another (larger) datatype, and expect good things to happen. The pointers you use must be properly aligned for the datatype being referenced.

这篇关于将char *转换为CUDA中的unsigned int *的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆