定时策略CUDA内核:优点和缺点? [英] Strategies for timing CUDA Kernels: Pros and Cons?

查看:250
本文介绍了定时策略CUDA内核:优点和缺点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当计算CUDA内核时,以下内容不工作,因为内核在执行时不会阻止CPU程序执行:

When timing CUDA kernels, the following doesn't work because the kernel doesn't block the CPU program execution while it executes:

start timer
kernel<<<g,b>>>();
end timer






(成功)定时CUDA内核的三种基本方法:


I've seen three basic ways of (successfully) timing CUDA kernels:

(1)两个CUDA eventRecords。

(1) Two CUDA eventRecords.

float responseTime; //result will be in milliseconds
cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start);
cudaEvent_t stop;  cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time

(2)一个CUDA eventRecord。

(2) One CUDA eventRecord.

float start = read_timer(); //helper function on CPU, in milliseconds
cudaEvent_t stop;  cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
float responseTime = read_timer() - start;

(3)deviceSynchronize而不是eventRecord。 (可能只在单个流中使用编程时有用。)

(3) deviceSynchronize instead of eventRecord. (Probably only useful when using programming in a single stream.)

float start = read_timer(); //helper function on CPU, in milliseconds
kernel<<<g,b>>>();
cudaDeviceSynchronize();
float responseTime = read_timer() - start;

我通过实验验证了这三个策略产生的时间结果相同。

I experimentally verified that these three strategies produce the same timing result.

问题:


  • 这些策略?

  • 除了定时多个流中的许多内核之外,使用两个事件记录和 cudaEventElapsedTime ()功能?

  • What are the tradeoffs of these strategies? Any hidden details here?
  • Aside from timing many kernels in multiple streams, is there any advantages of using two event records and the cudaEventElapsedTime() function?

可以使用你的想象力来弄清楚 read_timer()是什么。然而,提供一个示例实现不会有什么困难:

You can probably use your imagination to figure out what read_timer() does. Nevertheless, it can't hurt to provide an example implementation:

double read_timer(){
    struct timeval start;
    gettimeofday( &start, NULL ); //you need to include <sys/time.h>
    return (double)((start.tv_sec) + 1.0e-6 * (start.tv_usec))*1000; //milliseconds
}


推荐答案

排除了大部分的差异,说它们对于你所展示的相对简单的情况(可能不完全是真的,但我明白你的意思)产生相同的结果,和除了时间(复杂序列)...第一种情况明显更好。

You seem to have ruled out most of the differences by saying they all produce the same result for the relatively simple case you have shown (probably not exactly true, but I understand what you mean), and "Aside from timing (complex sequences) ..." where the first case is clearly better.

一个可能的区别是windows和linux之间的可移植性。我相信你的例子read_timer函数是面向linux的。你可能制作一个便携的read_timer函数,但是cuda事件系统(方法1)是可移植的。

One possible difference would be portability between windows and linux. I believe your example read_timer function is linux-oriented. You could probably craft a read_timer function that is "portable" but the cuda event system (method 1) is portable as-is.

这篇关于定时策略CUDA内核:优点和缺点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆