有一个等价于memcpy()在CUDA内核中工作吗? [英] Is there an equivalent to memcpy() that works inside a CUDA kernel?

查看:138
本文介绍了有一个等价于memcpy()在CUDA内核中工作吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用CUDA内核异步地拆分和重塑数组的结构。 memcpy()在内核中不工作, cudaMemcpy() *;

I'm trying to break apart and reshape the structure of an array asynchronously using the CUDA kernel. memcpy() doesn't work inside the kernel, and neither does cudaMemcpy()*; I'm at a loss.

任何人都可以告诉我从CUDA内核拷贝内存的首选方法?

Can anyone tell me the preferred method for copying memory from within the CUDA kernel?

值得注意的是, cudaMemcpy(void * to,void * from,size,cudaMemcpyDeviceToDevice)将无法正常工作,它只能从内核外部调用,并且不会异步执行。

It is worth noting, cudaMemcpy(void *to, void *from, size, cudaMemcpyDeviceToDevice) will NOT work for what I am trying to do, because it can only be called from outside of the kernel and does not execute asynchronously.

推荐答案

相当于在cuda内核中工作的 memcpy 。它被称为 memcpy 。例如:

Yes, there is an equivalent to memcpy that works inside cuda kernels. It is called memcpy. As an example:

__global__ void kernel(int **in, int **out, int len, int N)
{
    int idx = threadIdx.x + blockIdx.x*blockDim.x;

    for(; idx<N; idx+=gridDim.x*blockDim.x)
        memcpy(out[idx], in[idx], sizeof(int)*len);

}

这样无错编译:

$ nvcc -Xptxas="-v" -arch=sm_20 -c memcpy.cu 
ptxas info    : Compiling entry function '_Z6kernelPPiS0_ii' for 'sm_20'
ptxas info    : Function properties for _Z6kernelPPiS0_ii
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 11 registers, 48 bytes cmem[0]

并发出PTX:

.version 3.0
.target sm_20
.address_size 32

    .file   1 "/tmp/tmpxft_00000407_00000000-9_memcpy.cpp3.i"
    .file   2 "memcpy.cu"
    .file   3 "/usr/local/cuda/nvvm/ci_include.h"

.entry _Z6kernelPPiS0_ii(
    .param .u32 _Z6kernelPPiS0_ii_param_0,
    .param .u32 _Z6kernelPPiS0_ii_param_1,
    .param .u32 _Z6kernelPPiS0_ii_param_2,
    .param .u32 _Z6kernelPPiS0_ii_param_3
)
{
    .reg .pred  %p<4>;
    .reg .s32   %r<32>;
    .reg .s16   %rc<2>;


    ld.param.u32    %r15, [_Z6kernelPPiS0_ii_param_0];
    ld.param.u32    %r16, [_Z6kernelPPiS0_ii_param_1];
    ld.param.u32    %r2, [_Z6kernelPPiS0_ii_param_3];
    cvta.to.global.u32  %r3, %r15;
    cvta.to.global.u32  %r4, %r16;
    .loc 2 4 1
    mov.u32     %r5, %ntid.x;
    mov.u32     %r17, %ctaid.x;
    mov.u32     %r18, %tid.x;
    mad.lo.s32  %r30, %r5, %r17, %r18;
    .loc 2 6 1
    setp.ge.s32     %p1, %r30, %r2;
    @%p1 bra    BB0_5;

    ld.param.u32    %r26, [_Z6kernelPPiS0_ii_param_2];
    shl.b32     %r7, %r26, 2;
    .loc 2 6 54
    mov.u32     %r19, %nctaid.x;
    .loc 2 4 1
    mov.u32     %r29, %ntid.x;
    .loc 2 6 54
    mul.lo.s32  %r8, %r29, %r19;

BB0_2:
    .loc 2 7 1
    shl.b32     %r21, %r30, 2;
    add.s32     %r22, %r4, %r21;
    ld.global.u32   %r11, [%r22];
    add.s32     %r23, %r3, %r21;
    ld.global.u32   %r10, [%r23];
    mov.u32     %r31, 0;

BB0_3:
    add.s32     %r24, %r10, %r31;
    ld.u8   %rc1, [%r24];
    add.s32     %r25, %r11, %r31;
    st.u8   [%r25], %rc1;
    add.s32     %r31, %r31, 1;
    setp.lt.u32     %p2, %r31, %r7;
    @%p2 bra    BB0_3;

    .loc 2 6 54
    add.s32     %r30, %r8, %r30;
    ld.param.u32    %r27, [_Z6kernelPPiS0_ii_param_3];
    .loc 2 6 1
    setp.lt.s32     %p3, %r30, %r27;
    @%p3 bra    BB0_2;

BB0_5:
    .loc 2 9 2
    ret;
}

BB0_3 是由编译器自动发出的大小为 memcpy 的字节。从性能角度来看,使用它可能不是一个好主意,但它是完全支持的(并且在所有架构上已经有很长时间了)。

The code block at BB0_3 is a byte sized memcpy loop emitted automagically by the compiler. It might not be a great idea from a performance point-of-view to use it, but it is fully supported (and has been for a long time on all architectures).

四年后编辑补充说,由于设备端运行时API是作为CUDA 6发布周期的一部分发布的,因此也可以直接调用

Edited four years later to add that since the device side runtime API was released as part of the CUDA 6 release cycle, it is also possible to directly call something like

cudaMemcpy(void *to, void *from, size, cudaMemcpyDeviceToDevice)

在所有支持它的架构(Compute Capability 3.5和更新的硬件)的设备代码中。

in device code for all architectures which support it (Compute Capability 3.5 and newer hardware).

这篇关于有一个等价于memcpy()在CUDA内核中工作吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆