cuda中的memset允许在内核中设置值 [英] memset in cuda that allows to set values within kernel

查看:710
本文介绍了cuda中的memset允许在内核中设置值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在进行多个cudamemset调用,以便将我的值设置为0,如下所示:

i am making several cudamemset calls in order to set my values to 0 as below:

void allocateByte( char **gStoreR,const int byte){

    char **cStoreR = (char **)malloc(N * sizeof(char*));

    for( int i =0 ; i< N ; i++){
        char *c;
        cudaMalloc((void**)&c, byte*sizeof(char));

        cudaMemset(c,0,byte);
        cStoreR[i] = c;
    }
    cudaMemcpy(gStoreR, cStoreR, N * sizeof(char *), cudaMemcpyHostToDevice);
}

但是,事实证明这很慢. GPU上是否有memset函数,因为从CPU调用它需要花费大量时间.另外,cudaMalloc((void **)& c,byte * sizeof(char))是否会自动将c指向的位设置为0.

However, this is proving to be very slow. Is there a memset function on the GPU as calling it from CPU takes lot of time. Also, does cudaMalloc((void**)&c, byte*sizeof(char)) automatically set bits that c points to to 0.

推荐答案

每个cudaMemset调用都会启动一个内核,因此,如果N很大而byte很小,那么您将有很多内核启动开销放慢代码.没有设备端memset,因此解决方案是编写一个内核,该内核可在一次启动中遍历分配并将存储归零.

Every cudaMemset call launches a kernel, so if N is large and byte is small, then you will have a lot of kernel launch overhead slowing down the code. There is no device side memset, so the solution would be to write a kernel which traverses the allocations and zeros the storage in a single launch.

顺便说一句,我强烈建议不要在CUDA中使用数组结构.使用单个大的线性存储器块并在该存储器中建立索引来管理实现相同结果的过程要慢得多,也要复杂得多.在您的示例中,它将代码减少为单个cudaMalloc调用和单个cudaMemset调用.在设备方面,指针间接访问(缓慢的指针)被一些非常快的整数运算所代替.如果主机上的源材料是一系列结构,则建议您使用类似

As an aside, I would strongly recommend against using a structure of arrays in CUDA. It is a lot slower and much more complex to manage that achieving the same outcome using a single large block of linear memory and indexing into that memory. In your example, it would reduce the code to a single cudaMalloc call and a single cudaMemset call. On the device side, pointer indirection, which is slow, gets replaced by a few integer operations, which are very fast. If your source material on the host is an array of structures, I would recommend using something like the excellent thrust::zip_iterator to get the data into a GPU friendly form on the device.

这篇关于cuda中的memset允许在内核中设置值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆