printf里面CUDA __global__函数 [英] printf inside CUDA __global__ function

查看:461
本文介绍了printf里面CUDA __global__函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在GPU上写一个矩阵乘法,并想调试我的代码,但由于我不能在一个设备函数中使用printf,还有其他一些我可以做的,看看这个函数里面发生了什么。我当前的函数:

  __ global__ void MatrixMulKernel(Matrix Ad,Matrix Bd,Matrix Xd){

int tx = threadIdx.x;
int ty = threadIdx.y;

int bx = blockIdx.x;
int by = blockIdx.y;

float sum = 0;

for(int k = 0; k< Ad.width; ++ k){
float Melement = Ad.elements [ty * Ad.width + k];
float Nelement = Bd.elements [k * Bd.width + tx];
sum + = Melement * Nelement;
}

Xd.elements [ty * Xd.width + tx] = sum;
}



我想知道如果Ad和Bd是我认为的,

解决方案

编辑 b

为了避免误导人们,M. Tibbits指出printf可用于任何GPU的计算能力2.0及更高版本。



END OF EDIT



您可以选择:




  • GPU调试器,即Linux上的cuda-gdb或Windows上的Nexus

  • 使用cuprintf,可用于注册开发人员(注册此处

  • 手动复制您要查看的数据,然后将该缓冲区转储到主机(记住同步)



关于您的代码片段:




  • 考虑通过指针传递 Matrix 结构 cudaMemcpy 他们到设备,然后传入设备指针),现在你会没有问题,但如果函数签名变得非常大,那么你可能会碰到256字节限制

  • 您从广告中读取效率低下,每次读入Melement时,您都将有一个32字节的内存事务 - 考虑使用共享内存作为暂存区域(参见 transposeNew 在SDK中取样)


I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function:

__global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){

    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int bx = blockIdx.x;
    int by = blockIdx.y;

    float sum = 0;

    for( int k = 0; k < Ad.width ; ++k){
        float Melement = Ad.elements[ty * Ad.width + k];
        float Nelement = Bd.elements[k * Bd.width + tx];
        sum += Melement * Nelement;
    }

    Xd.elements[ty * Xd.width + tx] = sum;
}

I would love to know if Ad and Bd is what I think it is, and see if that function is actually being called.

解决方案

EDIT

To avoid misleading people, as M. Tibbits points out printf is available in any GPU of compute capability 2.0 and higher.

END OF EDIT

You have choices:

  • Use a GPU debugger, i.e. cuda-gdb on Linux or Nexus on Windows
  • Use cuprintf, which is available for registered developers (sign up here)
  • Manually copy the data that you want to see, then dump that buffer on the host after your kernel has completed (remember to synchronise)

Regarding your code snippet:

  • Consider passing the Matrix structs in via pointer (i.e. cudaMemcpy them to the device, then pass in the device pointer), right now you will have no problem but if the function signature gets very large then you may hit the 256 byte limit
  • You have inefficient reads from Ad, you will have a 32-byte transaction to the memory for each read into Melement - consider using shared memory as a staging area (c.f. the transposeNew sample in the SDK)

这篇关于printf里面CUDA __global__函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆