定时CUDA内核中的不同部分 [英] Timing different sections in CUDA kernel

查看:151
本文介绍了定时CUDA内核中的不同部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CUDA内核,调用一系列设备函数。

I have a CUDA kernel that calls out to a series of device functions.

每个设备函数的最佳执行时间是什么?

What is the best way to get the execution time for each of the device functions?

在一个设备函数中获取代码段的执行时间的最佳方法是什么?

What is the best way to get the execution time for a section of code in one of the device functions?

推荐答案

在我自己的代码中,我使用 clock()函数来获取精确的时序。为方便起见,我有宏

In my own code, I use the clock() function to get precise timings. For convenience, I have the macros

enum {
    tid_this = 0,
    tid_that,
    tid_count
    };
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
 #define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
 #define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
 #define TIMER_TIC
 #define TIMER_TOC(tid)
#endif

这些可以用来测试设备代码,如下所示:

These can then be used to instrument the device code as follows:

__global__ mykernel ( ... ) {

    /* Start the timer. */
    TIMER_TIC

    /* Do stuff. */
    ...

    /* Stop the timer and store the results to the "timer_this" counter. */
    TIMER_TOC( tid_this );

    }

然后,您可以阅读 cuda_timers

You can then read the cuda_timers in the host code.

一些注意事项:


  • 计时器在每个块的基础上工作,即如果您有100个块执行相同的内核,它们的所有时间的总和将被存储。

  • 话虽如此,计时器假定第零线程是活动的,因此请确保不要在代码的可能不同部分中调用这些宏。

  • 计时器计算时钟滴答的数量。要获取毫秒数,请将其除以您设备上的GHz数,再乘以1000。

  • 计时器可能会减慢您的代码,这就是为什么我将其封装在 #ifdef USETIMERS ,因此您可以轻松地关闭它们。

  • 虽然 clock $ c>返回类型 clock_t 的整数值,我将累积值存储为 float ,否则值将循环对于花费时间超过几秒钟的内核(在所有块上累积)。

  • 选择(toc> tic)? (toc-tic):(toc +(0xffffffff - tic)))是必要的。

  • The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
  • Having said that, the timer assumes that the zeroth thread is active, so make sure you do not call these macros in a possibly divergent part of the code.
  • The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
  • The timers can slow down your code a bit, which is why I wrapped them in the #ifdef USETIMERS so you can switch them off easily.
  • Although clock() returns integer values of type clock_t, I store the accumulated values as float, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
  • The selection ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) is necessary in case the clock counter wraps around.

PS这是我对此问题的回复的副本,但没有得到多个点因为所需的时间是整个内核。

P.S. This is a copy of my reply to this question, which didn't get many points there since the timing required was for the whole kernel.

这篇关于定时CUDA内核中的不同部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆