有一个等价于memcpy（）在CUDA内核中工作吗？ [英] Is there an equivalent to memcpy() that works inside a CUDA kernel?

查看：138 发布时间：2017/3/4 12:45:52 cuda

本文介绍了有一个等价于memcpy（）在CUDA内核中工作吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用CUDA内核异步地拆分和重塑数组的结构。 memcpy（）在内核中不工作， cudaMemcpy（） *;

I'm trying to break apart and reshape the structure of an array asynchronously using the CUDA kernel. memcpy() doesn't work inside the kernel, and neither does cudaMemcpy()*; I'm at a loss.

任何人都可以告诉我从CUDA内核拷贝内存的首选方法？

Can anyone tell me the preferred method for copying memory from within the CUDA kernel?

值得注意的是， cudaMemcpy（void * to，void * from，size，cudaMemcpyDeviceToDevice）将无法正常工作，它只能从内核外部调用，并且不会异步执行。

It is worth noting, cudaMemcpy(void *to, void *from, size, cudaMemcpyDeviceToDevice) will NOT work for what I am trying to do, because it can only be called from outside of the kernel and does not execute asynchronously.

推荐答案

相当于在cuda内核中工作的 memcpy 。它被称为 memcpy 。例如：

Yes, there is an equivalent to memcpy that works inside cuda kernels. It is called memcpy. As an example:

__global__ void kernel(int **in, int **out, int len, int N)
{
    int idx = threadIdx.x + blockIdx.x*blockDim.x;

    for(; idx<N; idx+=gridDim.x*blockDim.x)
        memcpy(out[idx], in[idx], sizeof(int)*len);

}

这样无错编译：

$ nvcc -Xptxas="-v" -arch=sm_20 -c memcpy.cu 
ptxas info    : Compiling entry function '_Z6kernelPPiS0_ii' for 'sm_20'
ptxas info    : Function properties for _Z6kernelPPiS0_ii
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 11 registers, 48 bytes cmem[0]

并发出PTX：

.version 3.0
.target sm_20
.address_size 32

    .file   1 "/tmp/tmpxft_00000407_00000000-9_memcpy.cpp3.i"
    .file   2 "memcpy.cu"
    .file   3 "/usr/local/cuda/nvvm/ci_include.h"

.entry _Z6kernelPPiS0_ii(
    .param .u32 _Z6kernelPPiS0_ii_param_0,
    .param .u32 _Z6kernelPPiS0_ii_param_1,
    .param .u32 _Z6kernelPPiS0_ii_param_2,
    .param .u32 _Z6kernelPPiS0_ii_param_3
)
{
    .reg .pred  %p<4>;
    .reg .s32   %r<32>;
    .reg .s16   %rc<2>;


    ld.param.u32    %r15, [_Z6kernelPPiS0_ii_param_0];
    ld.param.u32    %r16, [_Z6kernelPPiS0_ii_param_1];
    ld.param.u32    %r2, [_Z6kernelPPiS0_ii_param_3];
    cvta.to.global.u32  %r3, %r15;
    cvta.to.global.u32  %r4, %r16;
    .loc 2 4 1
    mov.u32     %r5, %ntid.x;
    mov.u32     %r17, %ctaid.x;
    mov.u32     %r18, %tid.x;
    mad.lo.s32  %r30, %r5, %r17, %r18;
    .loc 2 6 1
    setp.ge.s32     %p1, %r30, %r2;
    @%p1 bra    BB0_5;

    ld.param.u32    %r26, [_Z6kernelPPiS0_ii_param_2];
    shl.b32     %r7, %r26, 2;
    .loc 2 6 54
    mov.u32     %r19, %nctaid.x;
    .loc 2 4 1
    mov.u32     %r29, %ntid.x;
    .loc 2 6 54
    mul.lo.s32  %r8, %r29, %r19;

BB0_2:
    .loc 2 7 1
    shl.b32     %r21, %r30, 2;
    add.s32     %r22, %r4, %r21;
    ld.global.u32   %r11, [%r22];
    add.s32     %r23, %r3, %r21;
    ld.global.u32   %r10, [%r23];
    mov.u32     %r31, 0;

BB0_3:
    add.s32     %r24, %r10, %r31;
    ld.u8   %rc1, [%r24];
    add.s32     %r25, %r11, %r31;
    st.u8   [%r25], %rc1;
    add.s32     %r31, %r31, 1;
    setp.lt.u32     %p2, %r31, %r7;
    @%p2 bra    BB0_3;

    .loc 2 6 54
    add.s32     %r30, %r8, %r30;
    ld.param.u32    %r27, [_Z6kernelPPiS0_ii_param_3];
    .loc 2 6 1
    setp.lt.s32     %p3, %r30, %r27;
    @%p3 bra    BB0_2;

BB0_5:
    .loc 2 9 2
    ret;
}

BB0_3 是由编译器自动发出的大小为 memcpy 的字节。从性能角度来看，使用它可能不是一个好主意，但它是完全支持的（并且在所有架构上已经有很长时间了）。

The code block at BB0_3 is a byte sized memcpy loop emitted automagically by the compiler. It might not be a great idea from a performance point-of-view to use it, but it is fully supported (and has been for a long time on all architectures).

四年后编辑补充说，由于设备端运行时API是作为CUDA 6发布周期的一部分发布的，因此也可以直接调用

Edited four years later to add that since the device side runtime API was released as part of the CUDA 6 release cycle, it is also possible to directly call something like

cudaMemcpy(void *to, void *from, size, cudaMemcpyDeviceToDevice)

在所有支持它的架构（Compute Capability 3.5和更新的硬件）的设备代码中。

in device code for all architectures which support it (Compute Capability 3.5 and newer hardware).

这篇关于有一个等价于memcpy（）在CUDA内核中工作吗？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有一个等价于memcpy（）在CUDA内核中工作吗？ [英] Is there an equivalent to memcpy() that works inside a CUDA kernel?

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

有一个等价于memcpy（）在CUDA内核中工作吗？ [英] Is there an equivalent to memcpy() that works inside a CUDA kernel?

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭