如何将 CUDA 时钟周期转换为毫秒? [英] How to convert CUDA clock cycles to milliseconds?

查看:12
本文介绍了如何将 CUDA 时钟周期转换为毫秒?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想测量一些代码我的内核所花费的时间.我已经关注了 this question 及其评论,以便我的内核看起来有点像像这样:

I'd like to measure the time a bit of code within my kernel takes. I've followed this question along with its comments so that my kernel looks something like this:

__global__ void kernel(..., long long int *runtime)
{
    long long int start = 0; 
    long long int stop = 0;

    asm volatile("mov.u64 %0, %%clock64;" : "=l"(start));

    /* Some code here */

    asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop));

    runtime[threadIdx.x] = stop - start;
    ...
}

答案说要进行如下转换:

The answer says to do a conversion as follows:

计时器计算时钟滴答数.要获得毫秒数,请将其除以设备上的 GHz 数,然后乘以 1000.

The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.

我这样做:

for(long i = 0; i < size; i++)
{
  fprintf(stdout, "%d:%ld=%f(ms)
", i,runtime[i], (runtime[i]/1.62)*1000.0);
}

其中 1.62 是我设备的 GPU 最大时钟频率.但是我以毫秒为单位的时间看起来并不正确,因为它表明每个线程都需要几分钟才能完成.这不可能是正确的,因为执行在不到一秒的挂钟时间内完成.转换公式不正确还是我在某处犯了错误?谢谢.

Where 1.62 is the GPU Max Clock rate of my device. But the time I get in milliseconds does not look right because it suggests that each thread took minutes to complete. This cannot be correct as execution finishes in less than a second of wall-clock time. Is the conversion formula incorrect or am I making a mistake somewhere? Thanks.

推荐答案

你的情况下正确的转换不是GHz:

The correct conversion in your case is not GHz:

fprintf(stdout, "%d:%ld=%f(ms)
", i,runtime[i], (runtime[i]/1.62)*1000.0);
                                                             ^^^^

但是赫兹:

fprintf(stdout, "%d:%ld=%f(ms)
", i,runtime[i], (runtime[i]/1620000000.0f)*1000.0);
                                                             ^^^^^^^^^^^^^

在维度分析中:

                  clock cycles
clock cycles  /  -------------- = seconds
                   second
                    

第一项是时钟周期测量.第二项是 GPU 的频率(以赫兹为单位,而不是 GHz),第三项是所需的测量值(秒).您可以通过将秒乘以 1000 来转换为毫秒.

the first term is the clock cycle measurement. The second term is the frequency of the GPU (in hertz, not GHz), the third term is the desired measurement (seconds). You can convert to milliseconds by multiplying seconds by 1000.

这是一个工作示例,展示了一种独立于设备的方法(因此您不必对时钟频率进行硬编码):

Here's a worked example that shows a device-independent way to do it (so you don't have to hard-code the clock frequency):

$ cat t1306.cu
#include <stdio.h>

const long long delay_time = 1000000000;
const int nthr = 1;
const int nTPB = 256;

__global__ void kernel(long long *clocks){

  int idx=threadIdx.x+blockDim.x*blockIdx.x;
  long long start=clock64();
  while (clock64() < start+delay_time);
  if (idx < nthr) clocks[idx] = clock64()-start;
}

int main(){

  int peak_clk = 1;
  int device = 0;
  long long *clock_data;
  long long *host_data;
  host_data = (long long *)malloc(nthr*sizeof(long long));
  cudaError_t err = cudaDeviceGetAttribute(&peak_clk, cudaDevAttrClockRate, device);
  if (err != cudaSuccess) {printf("cuda err: %d at line %d
", (int)err, __LINE__); return 1;}
  err = cudaMalloc(&clock_data, nthr*sizeof(long long));
  if (err != cudaSuccess) {printf("cuda err: %d at line %d
", (int)err, __LINE__); return 1;}
  kernel<<<(nthr+nTPB-1)/nTPB, nTPB>>>(clock_data);
  err = cudaMemcpy(host_data, clock_data, nthr*sizeof(long long), cudaMemcpyDeviceToHost);
  if (err != cudaSuccess) {printf("cuda err: %d at line %d
", (int)err, __LINE__); return 1;}
  printf("delay clock cycles: %ld, measured clock cycles: %ld, peak clock rate: %dkHz, elapsed time: %fms
", delay_time, host_data[0], peak_clk, host_data[0]/(float)peak_clk);
  return 0;
}
$ nvcc -arch=sm_35 -o t1306 t1306.cu
$ ./t1306
delay clock cycles: 1000000000, measured clock cycles: 1000000210, peak clock rate: 732000kHz, elapsed time: 1366.120483ms
$

这使用 cudaDeviceGetAttribute 获取时钟频率,它返回以 kHz 为单位的结果,在这种情况下,我们可以轻松计算毫秒数.

This uses cudaDeviceGetAttribute to get the clock rate, which returns a result in kHz, which allows us to easily compute milliseconds in this case.

根据我的经验,上述方法通常适用于时钟速率以报告速率运行的数据中心 GPU(可能会受到您在 nvidia-smi 中进行的设置的影响.)其他 GPU,例如因为 GeForce GPU 可能会以(不可预测的)加速时钟运行,这会导致此方法不准确.

In my experience, the above method works generally well on datacenter GPUs that have the clock rate running at the reported rate (may be affected by settings you make in nvidia-smi.) Other GPUs such as GeForce GPUs may be running at (unpredictable) boost clocks that will make this method inaccurate.

此外,最近,CUDA 能够抢占 GPU 上的活动.这可以在多种情况下发生,例如调试、CUDA 动态并行和其他情况.如果出于某种原因发生抢占,尝试基于 clock64() 测量任何内容通常是不可靠的.

Also, more recently, CUDA has the ability to preempt activity on the GPU. This can come about in a variety of circumstances, such as debugging, CUDA dynamic parallelism, and other situations. If preemption occurs for whatever reason, attempting to measure anything based on clock64() is generally not reliable.

这篇关于如何将 CUDA 时钟周期转换为毫秒?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆