我怎么能叫从CUDA C一PTX功能? [英] How can I call a ptx function from CUDA C?

查看:484
本文介绍了我怎么能叫从CUDA C一PTX功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到一种方法来调用CUDA C的PTX功能(.FUNC)
说我有一个PTX的功能是这样的:

I am trying to find a way to call a ptx function (.func) from CUDA C. Say I had a ptx function like this:

.func (.reg .s32 %res) inc_ptr ( .reg .s32 %ptr, .reg .s32 %inc )
{
    add.s32 %res, %ptr, %inc;
    ret;
}

我知道我可以从PTX调用它像这样:

I know I can call it from ptx like so:

call (%d), inc_ptr, (%s, %d);

但我不知道如何将它从CUDA C.调用
我知道我可以内联ptx汇编使用ASM(),但我还没有找到一种方法,内联函数。
希望有人能帮助!

But I have no idea how to call it from CUDA C. I know I can inline ptx assembly with asm(), but I haven't found a way to inline a function. Hope someone can help!

谢谢!

推荐答案

这可以通过使用CUDA 5.0引入了单独的编译设施来完成。我不相信有一种方法来之前,CUDA 5.0或PTX修订为此在整体程序编译模式或工具包3.1之前的版本。

This can be done using the separate compilation facilities introduced with CUDA 5.0. I don't believe there is a way to do this in "whole" program compilation mode or in toolkit versions prior to CUDA 5.0 or in PTX revisions prior to 3.1.

有可能比较容易说明如何用工作实例做到这一点。让我们先从一个简单的PTX函数指针递增,类似你的例子:

It is probably easiest to illustrate how to do this with a worked example. Let's start with a simple PTX function for incrementing pointers, similar to your example:

.version 3.1
.target sm_30
.address_size 32
.visible .func inc_ptr(.param .b32 ptr, .param .b32 inc)
{
    .reg .s32   %r<6>;
    ld.param.u32 %r1, [ptr];
    ld.param.u32 %r2, [inc];
    ld.u32 %r3, [%r1];
    ld.u32 %r4, [%r3];
    add.s32 %r5, %r4, %r2;
    st.u32  [%r3], %r5;
    ret;
}

这可以用 ptxas 被编译为一个重新定位的设备对象,然后装入一个容器fatbinary文件。后者步骤似乎是至关重要的。默认 ptxas 输出仅仅是一个重新定位的精灵对象,也没有产生fatbinary容器。如此看来,设备code联动阶段中运行NVCC(至少在CUDA 5)期待所有的设备code是fatbinary容器present。联动否则会失败。结果是这样的:

This can be compiled to a relocatable device object using ptxas and then packed into a fatbinary container file. The latter step seems to be critical. The default ptxas output is only a relocatable elf object, there is no fatbinary container produced. It seems that the device code linkage phase that nvcc runs (at least in CUDA 5) is expecting all device code is present in fatbinary containers. The linkage will fail otherwise. The result looks like this:

$ ptxas -arch=sm_30 -c -o inc_ptr.gpu.o inc_ptr.ptx
$ fatbinary -arch=sm_30 -create inc_ptr.fatbin -elf inc_ptr.gpu.o 
$ cuobjdump -sass inc_ptr.fatbin 

Fatbin elf code:
================
arch = sm_30
code version = [1,6]
producer = <unknown>
host = mac
compile_size = 32bit

    code for sm_30
        Function : inc_ptr
    /*0008*/     /*0x0040dc8580000000*/     LD R3, [R4];
    /*0010*/     /*0x00301c8580000000*/     LD R0, [R3];
    /*0018*/     /*0x14001c0348000000*/     IADD R0, R0, R5;
    /*0020*/     /*0x00301c8590000000*/     ST [R3], R0;
    /*0028*/     /*0x00001de790000000*/     RET;
    /*0030*/     /*0x00001de440000000*/     NOP CC.T;
    /*0038*/     /*0x00001de440000000*/     NOP CC.T;
    /*0040*/     /*0xe0001de74003ffff*/     BRA 0x40;
    /*0048*/     /*0x00001de440000000*/     NOP CC.T;
    /*0050*/     /*0x00001de440000000*/     NOP CC.T;
    /*0058*/     /*0x00001de440000000*/     NOP CC.T;
    /*0060*/     /*0x00001de440000000*/     NOP CC.T;
    /*0068*/     /*0x00001de440000000*/     NOP CC.T;
    /*0070*/     /*0x00001de440000000*/     NOP CC.T;
    /*0078*/     /*0x00001de440000000*/     NOP CC.T;
        ........................

您可以看到fatbinary包含组装PTX微code。随着设备功能fatbin prepared,你可以做这样的事情在CUDA C code:

You can see that the fatbinary contains the microcode from the assembled PTX. With the device function fatbin prepared, you can do something like this in CUDA C code:

extern "C" __device__ void inc_ptr(int* &ptr, const int inc);

__global__
void memsetkernel(int *inout, const int val, const int N)
{
    int stride = blockDim.x * gridDim.x;
    int *p = inout;
    inc_ptr(p, threadIdx.x + blockDim.x*blockIdx.x);

    for(; p < inout+N; inc_ptr(p, stride)) *p = val;
}  


int main(void)
{
    const int n=10;
    int *p;
    cudaMalloc((void**)&p, sizeof(int)*size_t(n));
    memsetkernel<<<1,32>>>(p, 5, n);

    return 0;
}

在单独的编译模式下,设备code工具链将尊重的extern 申报和(只要你符号控制重整),设备功能fatbinary可以与其他设备和主机code被连接以产生最终对象

In separate compilation mode, the device code toolchain will respect the extern declaration and (as long as you get symbol mangling under control), the device function fatbinary can be linked with other device and host code to produce a final object:

$ nvcc -arch=sm_30 -Xptxas="-v" -dlink -o memset.out inc_ptr.fatbin memset_kernel.cu 

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z12memsetkernelPiii' for 'sm_30'
ptxas info    : Function properties for _Z12memsetkernelPiii
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 20 registers, 332 bytes cmem[0]

$ cuobjdump -sass memset.out 

Fatbin elf code:
================
arch = sm_30
code version = [1,6]
producer = <unknown>
host = mac
compile_size = 32bit
identifier = inc_ptr.fatbin memset_kernel.cu 

    code for sm_30
        Function : _Z12memsetkernelPiii
    /*0008*/     /*0x10005de428004001*/     MOV R1, c [0x0] [0x44];
    /*0010*/     /*0x20105d034800c000*/     IADD R1, R1, -0x8;
    /*0018*/     /*0x00019de428004005*/     MOV R6, c [0x0] [0x140];
    /*0020*/     /*0x10101c034800c000*/     IADD R0, R1, 0x4;
    /*0028*/     /*0x8400dc042c000000*/     S2R R3, SR_Tid_X;
    /*0030*/     /*0x90041c0348004000*/     IADD R16, R0, c [0x0] [0x24];
    /*0038*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
    /*0048*/     /*0xd0009de428004000*/     MOV R2, c [0x0] [0x34];
    /*0050*/     /*0x91045d0348004000*/     IADD R17, R16, -c [0x0] [0x24];
    /*0058*/     /*0x40011de428000000*/     MOV R4, R16;
    /*0060*/     /*0xa0015ca320064000*/     IMAD R5, R0, c [0x0] [0x28], R3;
    /*0068*/     /*0x01119c85c8000000*/     STL [R17], R6;
    /*0070*/     /*0xa0209ca350004000*/     IMUL R2, R2, c [0x0] [0x28];
    /*0078*/     /*0x0001000710000000*/     JCAL 0x0;
    /*0088*/     /*0x0110dc85c0000000*/     LDL R3, [R17];
    /*0090*/     /*0x20001de428004005*/     MOV R0, c [0x0] [0x148];
    /*0098*/     /*0x00049c4340004005*/     ISCADD R18, R0, c [0x0] [0x140], 0x2;
    /*00a0*/     /*0x4831dc031b0e0000*/     ISETP.GE.U32.AND P0, pt, R3, R18, pt;
    /*00a8*/     /*0x000001e780000000*/     @P0 EXIT;
    /*00b0*/     /*0x1004dde428004005*/     MOV R19, c [0x0] [0x144];
    /*00b8*/     /*0x0034dc8590000000*/     ST [R3], R19;
    /*00c8*/     /*0x40011de428000000*/     MOV R4, R16;
    /*00d0*/     /*0x08015de428000000*/     MOV R5, R2;
    /*00d8*/     /*0x0001000710000000*/     JCAL 0x0;
    /*00e0*/     /*0x0110dc85c0000000*/     LDL R3, [R17];
    /*00e8*/     /*0x4831dc03188e0000*/     ISETP.LT.U32.AND P0, pt, R3, R18, pt;
    /*00f0*/     /*0x000001e74003ffff*/     @P0 BRA 0xb8;
    /*00f8*/     /*0x00001de780000000*/     EXIT;
    /*0100*/     /*0xe0001de74003ffff*/     BRA 0x100;
    /*0108*/     /*0x00001de440000000*/     NOP CC.T;
    /*0110*/     /*0x00001de440000000*/     NOP CC.T;
    /*0118*/     /*0x00001de440000000*/     NOP CC.T;
    /*0120*/     /*0x00001de440000000*/     NOP CC.T;
    /*0128*/     /*0x00001de440000000*/     NOP CC.T;
    /*0130*/     /*0x00001de440000000*/     NOP CC.T;
    /*0138*/     /*0x00001de440000000*/     NOP CC.T;
        .....................................


        Function : inc_ptr
    /*0008*/     /*0x0040dc8580000000*/     LD R3, [R4];
    /*0010*/     /*0x00301c8580000000*/     LD R0, [R3];
    /*0018*/     /*0x14001c0348000000*/     IADD R0, R0, R5;
    /*0020*/     /*0x00301c8590000000*/     ST [R3], R0;
    /*0028*/     /*0x00001de790000000*/     RET;
    /*0030*/     /*0x00001de440000000*/     NOP CC.T;
    /*0038*/     /*0x00001de440000000*/     NOP CC.T;
    /*0040*/     /*0xe0001de74003ffff*/     BRA 0x40;
    /*0048*/     /*0x00001de440000000*/     NOP CC.T;
    /*0050*/     /*0x00001de440000000*/     NOP CC.T;
    /*0058*/     /*0x00001de440000000*/     NOP CC.T;
    /*0060*/     /*0x00001de440000000*/     NOP CC.T;
    /*0068*/     /*0x00001de440000000*/     NOP CC.T;
    /*0070*/     /*0x00001de440000000*/     NOP CC.T;
    /*0078*/     /*0x00001de440000000*/     NOP CC.T;
        ........................

有可能是它可以与工具链,才能获得这个其他的技巧,但这种方法肯定的作品。

There might be other tricks which can be played with the toolchain to achieve this, but this approach certainly works.

这篇关于我怎么能叫从CUDA C一PTX功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆