在 CUDA 中初始化设备数组 [英] Initialize device array in CUDA

查看:54
本文介绍了在 CUDA 中初始化设备数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何初始化使用 cudaMalloc() 分配的设备数组?

我尝试了 cudaMemset,但它无法初始化除 0.code 之外的所有值,因为 cudaMemset 如下所示,其中值初始化为 5.

cudaMemset(devPtr,value,number_bytes)

解决方案

正如您所发现的,cudaMemset 的工作方式类似于 C 标准库 memset.引用文档:

cudaError_t cudaMemset ( void * devPtr,整数值,size_t 计数)

<块引用>

填充devPtr指向的内存区域的前count个字节与常量字节值值.

所以 value 是一个 byte 值.如果您执行以下操作:

int *devPtr;cudaMalloc((void **)&devPtr,number_bytes);const int 值 = 5;cudaMemset(devPtr,value,number_bytes);

您要求发生的是 devPtr 的每个 byte 将设置为 5.如果 devPtr 是一个整数数组,结果将是每个整数单词的值都是 84215045.这可能不是您的想法.

使用运行时 API,您可以编写自己的通用内核来执行此操作.可以很简单

template__global__ void initKernel(T * devPtr, const T val, const size_t nwords){int tidx = threadIdx.x + blockDim.x * blockIdx.x;int stride = blockDim.x * gridDim.x;for(; tidx 

(标准免责声明:在浏览器中编写,从未编译,从未测试,使用风险自负).

只需为您需要的类型实例化模板并使用合适的网格和块大小调用它,注意最后一个参数现在是字数,而不是cudaMemset.无论如何,这与 cudaMemset 所做的并没有什么不同,使用该 API 调用会导致内核启动,这与我上面发布的内容大不相同.

或者,如果您可以使用驱动程序 API,则有 cuMemsetD16cuMemsetD32,它们执行相同的操作,但适用于半和全 32 位字类型.如果您需要设置 64 位或更大的类型(例如双精度或向量类型),最好的选择是使用您自己的内核.

How do I initialize device array which is allocated using cudaMalloc()?

I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.

cudaMemset(devPtr,value,number_bytes)

解决方案

As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:

cudaError_t cudaMemset  (   void *      devPtr,
                            int         value,
                            size_t      count    
                        )           

Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.

So value is a byte value. If you do something like:

int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);

what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.

Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as

template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
    int tidx = threadIdx.x + blockDim.x * blockIdx.x;
    int stride = blockDim.x * gridDim.x;

    for(; tidx < nwords; tidx += stride)
        devPtr[tidx] = val;
}

(standard disclaimer: written in browser, never compiled, never tested, use at own risk).

Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.

Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.

这篇关于在 CUDA 中初始化设备数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆