CUDA __global__函数里的printf [英] printf inside CUDA __global__ function

查看:1082
本文介绍了CUDA __global__函数里的printf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在写在GPU矩阵乘法,并想调试我的code,但因为我不能使用设备函数内部printf的,有没有别的东西我可以做的,看看有什么在里面怎么回事该功能。这是我目前的功能:

I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function:

__global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){

    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int bx = blockIdx.x;
    int by = blockIdx.y;

    float sum = 0;

    for( int k = 0; k < Ad.width ; ++k){
        float Melement = Ad.elements[ty * Ad.width + k];
        float Nelement = Bd.elements[k * Bd.width + tx];
        sum += Melement * Nelement;
    }

    Xd.elements[ty * Xd.width + tx] = sum;
}

我很想知道,如果广告和BD是我想是的,看看该功能实际上是被调用。

I would love to know if Ad and Bd is what I think it is, and see if that function is actually being called.

推荐答案

修改

要避免误导人,因为M.蒂比茨指出printf的是计算能力2.0或更高的GPU的任何可用的。

To avoid misleading people, as M. Tibbits points out printf is available in any GPU of compute capability 2.0 and higher.

编辑END

您已经选择:


  • 使用GPU的调试程序,在Windows上的Linux或Nexus即CUDA-GDB

  • 使用cuprintf,这是可以注册的开发(注册<一个href=\"http://nvdeveloper.nvidia.com/content/GPUComputingDeveloperApplication/frmDeveloperRegistration.asp\"相对=nofollow>这里)

  • 手动复制,你想看到的数据,然后倾倒在主机上的缓冲内核完成之后(记得同步)

关于你提到的code片断:

Regarding your code snippet:


  • 考虑通过指针传递矩阵结构中(即 cudaMemcpy 他们的设备,然后通过在设备指针),现在你不会有任何问题,但如果函数签名变得非常大,那么你可能会碰到256字节限制

  • 您已经从广告效率低下的读取,你将有一个32字节的交易为每个内存读入Melement - 考虑使用共享内存作为临时区域(比照的 transposeNew 的样品中的SDK )

  • Consider passing the Matrix structs in via pointer (i.e. cudaMemcpy them to the device, then pass in the device pointer), right now you will have no problem but if the function signature gets very large then you may hit the 256 byte limit
  • You have inefficient reads from Ad, you will have a 32-byte transaction to the memory for each read into Melement - consider using shared memory as a staging area (c.f. the transposeNew sample in the SDK)

这篇关于CUDA __global__函数里的printf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆