CUDA 内核计时策略:优点和缺点? [英] Strategies for timing CUDA Kernels: Pros and Cons?

查看:27
本文介绍了CUDA 内核计时策略:优点和缺点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在为 CUDA 内核计时时,以下内容不起作用,因为内核在执行时不会阻塞 CPU 程序执行:

When timing CUDA kernels, the following doesn't work because the kernel doesn't block the CPU program execution while it executes:

start timer
kernel<<<g,b>>>();
end timer

<小时>

我已经看到了三种(成功地)为 CUDA 内核计时的基本方法:


I've seen three basic ways of (successfully) timing CUDA kernels:

(1) 两个 CUDA eventRecords.

(1) Two CUDA eventRecords.

float responseTime; //result will be in milliseconds
cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start);
cudaEvent_t stop;  cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time

(2) 一个 CUDA eventRecord.

(2) One CUDA eventRecord.

float start = read_timer(); //helper function on CPU, in milliseconds
cudaEvent_t stop;  cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
float responseTime = read_timer() - start;

(3) deviceSynchronize 而不是 eventRecord.(可能仅在在单个流中使用编程时有用.)

(3) deviceSynchronize instead of eventRecord. (Probably only useful when using programming in a single stream.)

float start = read_timer(); //helper function on CPU, in milliseconds
kernel<<<g,b>>>();
cudaDeviceSynchronize();
float responseTime = read_timer() - start;

我通过实验验证了这三种策略产生相同的计时结果.

I experimentally verified that these three strategies produce the same timing result.

问题:

  • 这些策略的权衡是什么?这里有任何隐藏的细节吗?
  • 除了在多个流中计时多个内核之外,使用两个事件记录和 cudaEventElapsedTime() 函数有什么好处吗?

您可能可以发挥想象力来弄清楚 read_timer() 做了什么.不过,提供一个示例实现也无妨:

You can probably use your imagination to figure out what read_timer() does. Nevertheless, it can't hurt to provide an example implementation:

double read_timer(){
    struct timeval start;
    gettimeofday( &start, NULL ); //you need to include <sys/time.h>
    return (double)((start.tv_sec) + 1.0e-6 * (start.tv_usec))*1000; //milliseconds
}

推荐答案

您似乎已经排除了大部分差异,因为对于您展示的相对简单的案例,它们都产生相同的结果(可能不完全正确,但我明白你的意思),以及除了时间(复杂序列)......"第一种情况显然更好.

You seem to have ruled out most of the differences by saying they all produce the same result for the relatively simple case you have shown (probably not exactly true, but I understand what you mean), and "Aside from timing (complex sequences) ..." where the first case is clearly better.

一个可能的区别是 windows 和 linux 之间的可移植性.我相信您的示例 read_timer 函数是面向 linux 的.您可能可以制作一个可移植"的 read_timer 函数,但 cuda 事件系统(方法 1)是可移植的.

One possible difference would be portability between windows and linux. I believe your example read_timer function is linux-oriented. You could probably craft a read_timer function that is "portable" but the cuda event system (method 1) is portable as-is.

这篇关于CUDA 内核计时策略:优点和缺点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆