计时CUDA内核 [英] Timing CUDA kernels

查看:101
本文介绍了计时CUDA内核的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,每个即时通讯人员目前都在为我的一些CUDA代码计时。我能够使用事件为他们计时。我的内核运行了19毫秒。不知何故,我对此表示怀疑,因为当我依次执行此操作时,大约在5000毫秒处。我知道代码应该运行得更快,但是应该这么快吗?

hi every one im currently working on timing some of my CUDA code. I was able to time them using events. My kernel ran for 19 ms. Somehow I find this doubtful because when I ran a sequential implementation of this, it was at around 5000 ms. I know the code should run faster, but should it be this fast?

我正在使用包装函数在cpp程序中调用cuda内核。我应该在那里或在.cu文件中调用它们吗?谢谢!

I'm using wrapper functions to call cuda kernels in my cpp program. Am I supposed to be calling them there or in the .cu file? Thanks!

推荐答案

检查程序是否正常工作的明显方法是将输出与基于CPU的实现进行比较。如果得到相同的输出,则按定义运行,对吗? :)

The obvious way to check if your program is working would be to compare the output to that of your CPU based implementation. If you get the same output, it is working by definition, right? :)

如果您的程序处于实验性状态,实际上并不能产生任何可验证的输出,则编译器很有可能已经优化了一些(或全部)。编译器将删除对输出数据无用的代码。例如,如果存储了计算值的最后一条语句被注释掉,这可能导致内核的所有内容都被删除。

If your program is experimental in such a way that it doesn't really produce any verifiable output then there is a good chance that the compiler has optimized out some (or all) of your code. The compiler will remove code that does not contribute to output data. This can cause, for instance, that the entire contents of a kernel is removed if the final statement that stores the calculated value is commented out.

加速。 5000ms / 19ms = 263x,即使对于完美映射到GPU架构的算法来说,这也是不可能的。

As to your speedup. 5000ms / 19ms = 263x, which is an unlikely increase, even for algorithms that map perfectly to the GPU architecture.

这篇关于计时CUDA内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆