我的CUDA内核是否真的在设备上运行或在仿真中被主机误执行了? [英] Is my CUDA kernel really runs on device or is being mistekenly executed by host in emulation?

查看:112
本文介绍了我的CUDA内核是否真的在设备上运行或在仿真中被主机误执行了?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚获得了支持GPU的视频卡,并开始使用CUDA。为了让我对块和线程有所了解,我编写了一个简单的内核,该内核仅将其标识符存储到共享内存中,然后将其复制回主机并打印。但是,然后我为什么不直接在内核函数中使用 printf ?即使我相信这是不可能的,我还是尝试过。这是我的尝试:

I just got my GPU-enabled video card and started playing with CUDA. Just to get my head straight with blocks and threads I wrote a simple kernel that just stores its identifiers to the shared memory that I later copy back to host and print. But then I though, why not simply use printf inside the kernel function? I have tried that even though I believed that it was impossible. Here is what my attempt looked like:

__global__ void
printThreadXInfo (int *data)
{
    int i = threadIdx.x;
    data[i] = i;
    printf ("%d\n", i);
}

..但是突然之间,我在控制台中看到了输出。然后,我搜索了开发人员的手册,发现关于设备仿真的部分提到了 printf 。据说设备仿真提供了在内核中运行主机特定代码的好处,例如调用 printf

.. but all of the sudden I saw the output in console. Then I searched developer's manual and found printf mentioned in the section about device emulation. It was said that device emulation provides a benefit of running a host-specific code in the kernel, like calling printf.

我真的不需要致电 printf 。但是现在我有点困惑。我有两个假设。首先是NVidia开发人员在设备上实现了一些特定的 printf ,以某种方式透明地为开发人员访问了调用过程并执行了标准的 printf 功能,并负责内存复制等。这听起来有点疯狂。另一个假设是,我编译的代码以某种方式在仿真中运行,而不是在实际设备上运行。但这听起来也不对,因为我只是测量了在100万个元素数组上加两个数字的性能,而CUDA内核的处理速度比CPU快200倍。或者,当它检测到某些主机特定的代码时,它会在仿真中运行?如果是这样,那我为什么不发出警告呢?

I don't really need to call printf. But now I am a little bit confused. I have two assumption. First is that NVidia developers implemented some specific printf on device that somehow transparently for the developer accesses calling process and executed standard printf function, and takes care of memory copying etc. That sounds a bit crazy. Another assumption is that the code I have compiled somehow runs in emulation rather than on a real device. But that doesn't sound right either because I simply measured a performance of adding two numbers on 1 million elements array and CUDA kernel manages to do it like 200 faster than I can do on a CPU. Or maybe it runs in emulation when it detects some host-specific code? If that is true, why am I not issued a warning then?

请帮我解决一下。我在Linux上使用NVidia GeForce GTX 560 Ti(Intel Xeon,1个具有4个物理核心的CPU,8 GB RAM,如果有的话)。这是我的 nvcc 版本:

Please help me sort it out. I am using NVidia GeForce GTX 560 Ti on Linux (Intel Xeon, 1 CPU with 4 physical cores, 8 GB of RAM, if that matters). Here is my nvcc version:

$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Thu_May_12_11:09:45_PDT_2011
Cuda compilation tools, release 4.0, V0.2.1221

这是我编译代码的方式:

And here is how I compile my code:

/usr/local/cuda/bin/nvcc -gencode=arch=compute_20,code=\"sm_21,compute_20\" -m64 --compiler-options -fno-strict-aliasing -isystem /opt/boost_1_46_1/include -isystem /usr/local/cuda/include -I../include --compiler-bindir "/usr/local/cuda/bin" -O3 -DNDEBUG -o build_linux_release/ThreadIdxTest.cu.o -c ThreadIdxTest.cu

/usr/local/cuda/bin/nvcc -gencode=arch=compute_20,code=\"sm_21,compute_20\" -m64 --compiler-options -fno-strict-aliasing -isystem /opt/boost_1_46_1/include -isystem /usr/local/cuda/include -I../include --compiler-bindir "/usr/local/cuda/bin" -O3 -DNDEBUG --generate-dependencies ThreadIdxTest.cu | sed -e "s;ThreadIdxTest.o;build_linux_release/ThreadIdxTest.cu.o;g" > build_linux_release/ThreadIdxTest.d

g++ -pipe -m64 -ftemplate-depth-1024 -fno-strict-aliasing -fPIC -pthread -DNDEBUG -fomit-frame-pointer -momit-leaf-frame-pointer -fno-tree-pre -falign-loops -Wuninitialized -Wstrict-aliasing -ftree-vectorize -ftree-loop-linear -funroll-loops -fsched-interblock -march=native -mtune=native -g0 -O3 -ffor-scope -fuse-cxa-atexit -fvisibility-inlines-hidden -Wall -Wextra -Wreorder -Wcast-align -Winit-self -Wmissing-braces -Wmissing-include-dirs -Wswitch-enum -Wunused-parameter -Wredundant-decls -Wreturn-type -isystem /opt/boost_1_46_1/include -isystem /usr/local/cuda/include -I../include -L/opt/boost_1_46_1/lib -L/usr/local/cuda/lib64 -lcudart -lgtest -lgtest_main build_linux_release/ThreadIdxTest.cu.o ../src/build_linux_release/libspartan.a -o build_linux_release/ThreadIdxTest

...并且顺便说一句,主机代码和内核代码都混合在一个扩展名为 .cu 的源文件中(也许我不应该做到这一点,但我在SDK示例中看到了这种样式。)

... and by the way, both host code and kernel code is mixed in one source file with .cu extension (maybe I am not supposed to do that, but I saw this style in SDK examples).

我们非常感谢您的帮助。谢谢!

Your help is highly appreciated. Thank you!

推荐答案

从CUDA?3.1开始,它们不再进行任何设备仿真。内核现在支持Printf。

As of CUDA ?3.1?, they no longer do any device emulation. Printf's are now supported in the kernel.

这篇关于我的CUDA内核是否真的在设备上运行或在仿真中被主机误执行了?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆