在 CUDA 中初始化设备数组 [英] Initialize device array in CUDA
问题描述
如何初始化使用 cudaMalloc()
分配的设备数组?
我尝试了 cudaMemset
,但它无法初始化除 0.code
之外的所有值,因为 cudaMemset 如下所示,其中值初始化为 5.
cudaMemset(devPtr,value,number_bytes)
正如您所发现的,cudaMemset
的工作方式类似于 C 标准库 memset
.引用文档:
cudaError_t cudaMemset ( void * devPtr,整数值,size_t 计数)
<块引用>
填充devPtr指向的内存区域的前count个字节与常量字节值值.
所以 value
是一个 byte 值.如果您执行以下操作:
int *devPtr;cudaMalloc((void **)&devPtr,number_bytes);const int 值 = 5;cudaMemset(devPtr,value,number_bytes);
您要求发生的是 devPtr
的每个 byte 将设置为 5.如果 devPtr
是一个整数数组,结果将是每个整数单词的值都是 84215045.这可能不是您的想法.
使用运行时 API,您可以编写自己的通用内核来执行此操作.可以很简单
template__global__ void initKernel(T * devPtr, const T val, const size_t nwords){int tidx = threadIdx.x + blockDim.x * blockIdx.x;int stride = blockDim.x * gridDim.x;for(; tidx
(标准免责声明:在浏览器中编写,从未编译,从未测试,使用风险自负).
只需为您需要的类型实例化模板并使用合适的网格和块大小调用它,注意最后一个参数现在是字数,而不是cudaMemset
.无论如何,这与 cudaMemset
所做的并没有什么不同,使用该 API 调用会导致内核启动,这与我上面发布的内容大不相同.
或者,如果您可以使用驱动程序 API,则有 cuMemsetD16
和 cuMemsetD32
,它们执行相同的操作,但适用于半和全 32 位字类型.如果您需要设置 64 位或更大的类型(例如双精度或向量类型),最好的选择是使用您自己的内核.
How do I initialize device array which is allocated using cudaMalloc()
?
I tried cudaMemset
, but it fails to initialize all values except 0.code
, for cudaMemset looks like below, where value is initialized to 5.
cudaMemset(devPtr,value,number_bytes)
As you are discovering, cudaMemset
works like the C standard library memset
. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
)
Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.
So value
is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);
what you are asking to happen is that each byte of devPtr
will be set to 5. If devPtr
was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
}
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset
. This isn't really any different to what cudaMemset
does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16
and cuMemsetD32
, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
这篇关于在 CUDA 中初始化设备数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!