两种计时器指示的计时不同 [英] Different timing indicated from two kind of timers

查看:109
本文介绍了两种计时器指示的计时不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用两种计时器来测量GPU内核的运行时间。如下所示,我使用cudaEventRecord测量整个内核,并且在内核内部具有clock()函数。但是,代码的输出显示两个计时器的测量结果不同:



gpu freq = 1530000 khz



您好,来自块0,线程0



内核运行时间:0.0002453秒



内核周期:68194



根据结果,内核经过68194个时钟周期,对应的时间应为68194/1530000000 = 0.00004457124秒。但是cudaEventRecorder显示为0.0002453秒。谁能解释为什么?谢谢。



=========================

  #include< iostream> 
#include< stdio.h>
#include< math.h>
__global__ void add(int * runtime)
{
clock_t start_time = clock();
printf(您好,来自块%d,线程%d\n,blockIdx.x,threadIdx.x);
clock_t end_time = clock();
*运行时间=(int)(结束时间-开始时间);
}

int main(void)
{
int *运行时;

cudaDeviceProp属性;
int结果= cudaGetDeviceProperties(& prop,0);
printf( gpu freq =%d khz\n,prop.clockRate);

cudaMallocManaged(& runtime,sizeof(int));
*运行时间= 0;

cudaEvent_t开始,停止;
cudaEventCreate(& start);
cudaEventCreate(& stop);
cudaEventRecord(start);
add(< 1,1>>>(运行时);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaStreamSynchronize(0);

浮动毫秒= 0.f;
cudaEventElapsedTime(& milliseconds,start,stop);
浮动秒=毫秒/ 1000.f;

printf(内核运行时:%。7f seconds\n,秒);
cudaDeviceSynchronize();
printf(内核周期:%d\n,*运行时);

cudaFree(运行时);

返回0;
}


解决方案

我不会使用托管如果可以避免的话,请记住这种工作。它引入了很多复杂性(除非您喜欢这种东西)。



要了解托管内存的性能,了解您正在运行的GPU,哪个CUDA版本以及哪个OS(CentOS)至关重要。



我正在使用CUDA 10.1.243的CentOS上的Tesla V100上运行,并且发现运行之间的差异很大(大约3到10倍)。我将其归因于正在进行的内存需求分页。



让我们看看我的SASS代码:

  $ nvcc -arch = sm_70 -o t1627 t1627.cu 
$ cuobjdump -sass ./t1627

Fatbin elf代码:
===============
arch = sm_70
代码版本= [1,7]
生产者=<未知>
host = linux
compile_size = 64bit

sm_70

Fatbin elf代码:
========= =======
arch = sm_70
代码版本= [1,7]
生产者=< unknown>
host = linux
compile_size = 64bit

代码sm_70
功能:_Z3addPi
.headerflags @ EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)
/ * 0000 * / IMAD.MOV.U32 R1,RZ,RZ,c [0x0] [0x28]; / * 0x00000a00ff017624 * /
/ * 0x000fd000078e00ff * /
/ * 0010 * / @!PT SHFL.IDX PT,RZ,RZ,RZ,RZ; / * 0x000000fffffff389 * /
/ * 0x000fe200000e00ff * /
/ * 0020 * / IADD3 R1,R1,-0x8,RZ; / * 0xfffffff801017810 * /
/ * 0x000fc80007ffe0ff * /
/ * 0030 * / IADD3 R6,P0,R1,c [0x0] [0x20],RZ; / * 0x0000080001067a10 * /
/ * 0x000fca0007f1e0ff * /
/ * 0040 * / IMAD.X R7,RZ,RZ,c [0x0] [0x24],P0; / * 0x00000900ff077624 * /
/ * 0x000fd000000e06ff * /
/ * 0050 * / CS2R.32 R2,SR_CLOCKLO; //开始内核计时
/ * 0x000fd00000005000 * /
/ * 0060 * / S2R R9,SR_TID.X; / * 0x0000000000097919 * /
/ * 0x000e220000002100 * /
/ * 0070 * / MOV R4,0x0; / * 0x0000000000047802 * /
/ * 0x000fe40000000f00 * /
/ * 0080 * / MOV R5,0x0; / * 0x0000000000057802 * /
/ * 0x000fe20000000f00 * /
/ * 0090 * / S2R R8,SR_CTAID.X; / * 0x0000000000087919 * /
/ * 0x000e280000002500 * /
/ * 00a0 * / STL.64 [R1],R8; / * 0x0000000801007387 * /
/ * 0x0011e60000100a00 * /
/ * 00b0 * / MOV R20,0x0; / * 0x0000000000147802 * /
/ * 0x000fe40000000f00 * /
/ * 00c0 * / MOV R21,0x0; / * 0x0000000000157802 * /
/ * 0x000fd00000000f00 * /
/ * 00d0 * / CALL.ABS.NOINC 0x0; // printf call
/ * 0x001fea0003c00000 * /
/ * 00e0 * / CS2R.32 R5,SR_CLOCKLO; //结束内核计时
/ * 0x000fd00000005000 * /
/ * 00f0 * / IMAD.IADD R5,R5,0x1,-R2; / * 0x0000000105057824 * /
/ * 0x000fe400078e0a02 * /
/ * 0100 * / IMAD.MOV.U32 R2,RZ,RZ,c [0x0] [0x160]; //设置管理地址
/ * 0x000fc400078e00ff * /
/ * 0110 * / IMAD.MOV.U32 R3,RZ,RZ,c [0x0] [0x164]; / * 0x00005900ff037624 * /
/ * 0x000fd000078e00ff * /
/ * 0120 * / STG.E.SYS [R2],R5; //首先(仅)触摸托管分配
/ * 0x000fe2000010e900 * /
/ * 0130 * / EXIT; / * 0x000000000000794d * /
/ * 0x000fea0003800000 * /
/ * 0140 * / BRA 0x140; / * 0xfffffff000007947 * /
/ * 0x000fc0000383ffff * /
/ * 0150 * / NOP; / * 0x0000000000007918 * /
/ * 0x000fc00000000000 * /
/ * 0160 * / NOP; / * 0x0000000000007918 * /
/ * 0x000fc00000000000 * /
/ * 0170 * / NOP; / * 0x0000000000007918 * /
/ * 0x000fc00000000000 * /
...................



Fatbin ptx代码:
================
拱形= sm_70
代码版本= [6,4]
生产者=<未知>
主机= Linux
compile_size = 64位
压缩
$

我在上面添加了一些评论。内核内计时区域(在源代码中读取 clock())在0050和00e0行处进行了描述。在行00e0之后(因此,在完成内核内计时之后),您在行0120上触摸托管分配运行时,以存储结果。



就我而言,我有一台Tesla V100,在CentOS 7上具有CUDA 10.1.243。这是统一内存的按需分配机制。在这种情况下,第一次触摸托管分配将触发页面错误。页面错误由主机操作系统和CUDA运行时(实际上是设备操作系统)之间的复杂交互作用来解决。此页面错误服务将在内核内时序测量的外部处进行,但将通过内核级时序(即,它影响内核持续时间)(例如基于cuda事件的时序或探查器)进行度量。 / p>

如果我修改您的代码以使用普通的设备分配,那么运行时的巨大可变性就会消失。如果对我认为是好的基准测试实践(例如进行热身测试)进行了一些其他更改,我发现这些数字彼此之间的对应关系会更好:

  $ cat t1627.cu 
#include< iostream>
#include< stdio.h>
#include< math.h>
__global__ void add(int * runtime)
{
clock_t start_time = clock();
printf(您好,来自块%d,线程%d\n,blockIdx.x,threadIdx.x);
clock_t end_time = clock();
*运行时间=(int)(结束时间-开始时间);
}

int main(void)
{
int *运行时;

cudaDeviceProp属性;
int结果= cudaGetDeviceProperties(& prop,0);
printf( gpu freq =%d khz\n,prop.clockRate);

cudaMalloc(& runtime,sizeof(int));
cudaMemset(runtime,0,sizeof(int));

cudaEvent_t开始,停止;
cudaEventCreate(& start);
cudaEventCreate(& stop);
add(< 1,1>>>(运行时);
cudaDeviceSynchronize();
cudaEventRecord(start);
add(< 1,1>>>(运行时);
cudaEventRecord(stop);
cudaEventSynchronize(stop);

浮动毫秒= 0.f;
cudaEventElapsedTime(& milliseconds,start,stop);
浮动秒=毫秒/ 1000.f;

printf(内核运行时:%fs \n,秒);
int h_runtime;
cudaMemcpy(&h_runtime,runtime,sizeof(int),cudaMemcpyDeviceToHost);
printf(内核周期:%d\n,h_runtime);

cudaFree(运行时);

返回0;
}
$ nvcc -arch = sm_70 -o t1627 t1627.cu
$ ./t1627
gpu freq = 1380000 khz
从第0块开始,线程0
来自块0,线程0的Hello
内核运行时:0.000059 s
内核周期:57376
$ nvprof ./t1627
== 28252 == NVPROF正在分析过程28252 ,命令:./t1627
gpu freq = 1380000 khz
来自块0,线程0的Hello
来自块0,线程0的Hello
内核运行时:0.000069 s
内核周期:58997
== 28252 ==分析应用程序:./t1627
== 28252 ==分析结果:
类型Time(%)时间调用平均最小值最大值名称
GPU活动:96.49%109.00us 2 54.497us 49.569us 59.426us add(int *)
1.93%2.1760us 1 2.1760us 2.1760us 2.1760us [CUDA memcpy DtoH]
1.59%1.7920us 1 1.7920us 1.7920us 1.7920us [CUDA memset]
API调用:96.20%329.20ms 1 329.20ms 329.20ms 329.20ms cuda Malloc
1.58%5.4205ms 4 1.3551ms 695.98us 3.3263ms cuDeviceTotalMem
1.56%5.3336ms 388 13.746us 357ns 614.73us cuDeviceGetAttribute
0.35%1.1925ms 1 1.1925ms 1.1925ms 1.1925ms cudaGetDeviceProperties
0.13%435.16us 4 108.79us 103.50us 114.98us cuDeviceGetName
0.07%235.87us 1 235.87us 235.87us 235.87us cudaFree
0.03%114.74us 2 57.371us 17.808us 96.935us cudaLaunchKernel
0.03%88.291us 1 88.291us 88.291us 88.291us cudaDeviceSynchronize
0.02%59.720us 1 59.720us 59.720us 59.720us cudaEventSynchronize
0.01%35.692us 1 35.692us 35.692us 35.692us cudaMemcpy
0.01% 26.655us 4 6.6630us 3.8710us 11.334us cuDeviceGetPCIB usId
0.01%26.631us 1 26.631us 26.631us 26.631us cudaMemset
0.00%16.933us 2 8.4660us 5.9710us 10.962us cudaEventRecord
0.00%8.8200us 8 1.1020us 449ns 1.8970us cuDeviceGet
0.00%8.5660us 2 4.2830us 1.0320us 7.5340us cudaEventCreate
0.00%4.0930us 3 1.3640us 390ns 2.3880us cuDeviceGetCount
0.00%3.6490us 1 3.6490us 3.6490us 3.6490us cudaEventElapsedTime
0.00 %2.9010us 4 725ns 547ns 900ns cuDeviceGetUuid
$

在内核中:57376/1380000000 = 41.5我们



事件:69us



nvprof:49.57us



请注意,上面计算的内核内测量假设GPU基本上以其最大时钟速率运行。并非总是如此,通常也不是这样。因此,隐含的测量可能会高于上面计算的结果(如果有效时钟频率低于最大值)。最近所有的GPU都有可变的时钟方案。


I’m trying to use two kind of timers to measure the run time of a GPU kernel. As the code indicated below, I have cudaEventRecord measuring the overall kernel and inside the kernel I have clock() functions. However, the output of the code shows that two timers got different measurements:

gpu freq = 1530000 khz

Hello from block 0, thread 0

kernel runtime: 0.0002453 seconds

kernel cycle: 68194

According to results, the kernel elapsed 68194 clock cycles, the corresponded time should be 68194/1530000000 = 0.00004457124 seconds. But the cudaEventRecorder showed 0.0002453 seconds. Could anyone explain why? Thank you.

============================

#include <iostream>
#include <stdio.h>
#include <math.h>
__global__ void add(int *runtime)
{
  clock_t start_time = clock();
  printf("Hello from block %d, thread %d\n", blockIdx.x, threadIdx.x);
  clock_t end_time = clock();
  *runtime = (int)(end_time - start_time);
}

int main(void)
{
  int *runtime;

  cudaDeviceProp prop;
  int result = cudaGetDeviceProperties(&prop, 0);
  printf("gpu freq = %d khz\n", prop.clockRate);

  cudaMallocManaged(&runtime, sizeof(int));
  *runtime = 0;

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  cudaEventRecord(start);
  add<<<1, 1>>>(runtime);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaStreamSynchronize(0);

  float miliseconds = 0.f;
  cudaEventElapsedTime(&miliseconds, start, stop);
  float seconds = miliseconds / 1000.f;

  printf("kernel runtime: %.7f seconds\n", seconds);
  cudaDeviceSynchronize();
  printf("kernel cycle: %d\n", *runtime);

  cudaFree(runtime);

  return 0;
}

解决方案

I wouldn't use managed memory for this kind of work, if I could avoid it. It introduces a lot of complexity (unless you like that sort of thing).

To understand managed memory performance, its important to know which GPU you are running on, which CUDA version, and also which OS (CentOS).

I'm running on a Tesla V100 on CentOS with CUDA 10.1.243 and I see large variability (on the order of 3x to 10x) run-to-run. I attribute this to the demand-paging of memory that is going on.

Let's take a look at my SASS code:

$ nvcc -arch=sm_70 -o t1627 t1627.cu
$ cuobjdump -sass ./t1627

Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_70

Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_70
                Function : _Z3addPi
        .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;   /* 0x00000a00ff017624 */
                                                                             /* 0x000fd000078e00ff */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;             /* 0x000000fffffff389 */
                                                                             /* 0x000fe200000e00ff */
        /*0020*/                   IADD3 R1, R1, -0x8, RZ ;                  /* 0xfffffff801017810 */
                                                                             /* 0x000fc80007ffe0ff */
        /*0030*/                   IADD3 R6, P0, R1, c[0x0][0x20], RZ ;      /* 0x0000080001067a10 */
                                                                             /* 0x000fca0007f1e0ff */
        /*0040*/                   IMAD.X R7, RZ, RZ, c[0x0][0x24], P0 ;     /* 0x00000900ff077624 */
                                                                             /* 0x000fd000000e06ff */
        /*0050*/                   CS2R.32 R2, SR_CLOCKLO ;                  //begin in-kernel timing
                                                                             /* 0x000fd00000005000 */
        /*0060*/                   S2R R9, SR_TID.X ;                        /* 0x0000000000097919 */
                                                                             /* 0x000e220000002100 */
        /*0070*/                   MOV R4, 0x0 ;                             /* 0x0000000000047802 */
                                                                             /* 0x000fe40000000f00 */
        /*0080*/                   MOV R5, 0x0 ;                             /* 0x0000000000057802 */
                                                                             /* 0x000fe20000000f00 */
        /*0090*/                   S2R R8, SR_CTAID.X ;                      /* 0x0000000000087919 */
                                                                             /* 0x000e280000002500 */
        /*00a0*/                   STL.64 [R1], R8 ;                         /* 0x0000000801007387 */
                                                                             /* 0x0011e60000100a00 */
        /*00b0*/                   MOV R20, 0x0 ;                            /* 0x0000000000147802 */
                                                                             /* 0x000fe40000000f00 */
        /*00c0*/                   MOV R21, 0x0 ;                            /* 0x0000000000157802 */
                                                                             /* 0x000fd00000000f00 */
        /*00d0*/                   CALL.ABS.NOINC 0x0 ;                      //printf call
                                                                             /* 0x001fea0003c00000 */
        /*00e0*/                   CS2R.32 R5, SR_CLOCKLO ;                  //end in-kernel timing
                                                                             /* 0x000fd00000005000 */
        /*00f0*/                   IMAD.IADD R5, R5, 0x1, -R2 ;              /* 0x0000000105057824 */
                                                                             /* 0x000fe400078e0a02 */
        /*0100*/                   IMAD.MOV.U32 R2, RZ, RZ, c[0x0][0x160] ;  // set up managed address
                                                                             /* 0x000fc400078e00ff */
        /*0110*/                   IMAD.MOV.U32 R3, RZ, RZ, c[0x0][0x164] ;  /* 0x00005900ff037624 */
                                                                             /* 0x000fd000078e00ff */
        /*0120*/                   STG.E.SYS [R2], R5 ;                      // first (only) touch on managed allocation
                                                                             /* 0x000fe2000010e900 */
        /*0130*/                   EXIT ;                                    /* 0x000000000000794d */
                                                                             /* 0x000fea0003800000 */
        /*0140*/                   BRA 0x140;                                /* 0xfffffff000007947 */
                                                                             /* 0x000fc0000383ffff */
        /*0150*/                   NOP;                                      /* 0x0000000000007918 */
                                                                             /* 0x000fc00000000000 */
        /*0160*/                   NOP;                                      /* 0x0000000000007918 */
                                                                             /* 0x000fc00000000000 */
        /*0170*/                   NOP;                                      /* 0x0000000000007918 */
                                                                             /* 0x000fc00000000000 */
                ...................



Fatbin ptx code:
================
arch = sm_70
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
$

I've added some comments above. The in-kernel timing region (where you read clock() in the source code) is delineated at lines 0050 and 00e0. After line 00e0, (so, after you have finished the in-kernel timing) you are touching the managed allocation runtime, to store the result, on line 0120.

In my case, I have a Tesla V100, with CUDA 10.1.243 on CentOS 7. This is a demand-paged regime for unified memory. In that case, the first touch to a managed allocation will trigger a page fault. The page fault is serviced by a complex interaction between the host operating system and the CUDA runtime (effectively the device operating system). This page fault servicing will take place outside of your in-kernel timing measurement, but will be measured by kernel-level timing (i.e. it impacts kernel duration) such as cuda event based timing, or profilers.

If I modify your code to use an ordinary device allocation, the large runtime variability goes away. If I make some additional changes for what I consider to be good benchmarking practice (such as performing a warm-up run), I find that the numbers correspond to each other somewhat better:

$ cat t1627.cu
#include <iostream>
#include <stdio.h>
#include <math.h>
__global__ void add(int *runtime)
{
  clock_t start_time = clock();
  printf("Hello from block %d, thread %d\n", blockIdx.x, threadIdx.x);
  clock_t end_time = clock();
  *runtime = (int)(end_time - start_time);
}

int main(void)
{
  int *runtime;

  cudaDeviceProp prop;
  int result = cudaGetDeviceProperties(&prop, 0);
  printf("gpu freq = %d khz\n", prop.clockRate);

  cudaMalloc(&runtime, sizeof(int));
  cudaMemset(runtime, 0, sizeof(int));

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  add<<<1, 1>>>(runtime);
  cudaDeviceSynchronize();
  cudaEventRecord(start);
  add<<<1, 1>>>(runtime);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);

  float miliseconds = 0.f;
  cudaEventElapsedTime(&miliseconds, start, stop);
  float seconds = miliseconds / 1000.f;

  printf("kernel runtime: %f s \n", seconds);
  int h_runtime;
  cudaMemcpy(&h_runtime, runtime, sizeof(int), cudaMemcpyDeviceToHost);
  printf("kernel cycle: %d\n", h_runtime);

  cudaFree(runtime);

  return 0;
}
$ nvcc -arch=sm_70 -o t1627 t1627.cu
$ ./t1627
gpu freq = 1380000 khz
Hello from block 0, thread 0
Hello from block 0, thread 0
kernel runtime: 0.000059 s
kernel cycle: 57376
$ nvprof ./t1627
==28252== NVPROF is profiling process 28252, command: ./t1627
gpu freq = 1380000 khz
Hello from block 0, thread 0
Hello from block 0, thread 0
kernel runtime: 0.000069 s
kernel cycle: 58997
==28252== Profiling application: ./t1627
==28252== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   96.49%  109.00us         2  54.497us  49.569us  59.426us  add(int*)
                    1.93%  2.1760us         1  2.1760us  2.1760us  2.1760us  [CUDA memcpy DtoH]
                    1.59%  1.7920us         1  1.7920us  1.7920us  1.7920us  [CUDA memset]
      API calls:   96.20%  329.20ms         1  329.20ms  329.20ms  329.20ms  cudaMalloc
                    1.58%  5.4205ms         4  1.3551ms  695.98us  3.3263ms  cuDeviceTotalMem
                    1.56%  5.3336ms       388  13.746us     357ns  614.73us  cuDeviceGetAttribute
                    0.35%  1.1925ms         1  1.1925ms  1.1925ms  1.1925ms  cudaGetDeviceProperties
                    0.13%  435.16us         4  108.79us  103.50us  114.98us  cuDeviceGetName
                    0.07%  235.87us         1  235.87us  235.87us  235.87us  cudaFree
                    0.03%  114.74us         2  57.371us  17.808us  96.935us  cudaLaunchKernel
                    0.03%  88.291us         1  88.291us  88.291us  88.291us  cudaDeviceSynchronize
                    0.02%  59.720us         1  59.720us  59.720us  59.720us  cudaEventSynchronize
                    0.01%  35.692us         1  35.692us  35.692us  35.692us  cudaMemcpy
                    0.01%  26.655us         4  6.6630us  3.8710us  11.334us  cuDeviceGetPCIBusId
                    0.01%  26.631us         1  26.631us  26.631us  26.631us  cudaMemset
                    0.00%  16.933us         2  8.4660us  5.9710us  10.962us  cudaEventRecord
                    0.00%  8.8200us         8  1.1020us     449ns  1.8970us  cuDeviceGet
                    0.00%  8.5660us         2  4.2830us  1.0320us  7.5340us  cudaEventCreate
                    0.00%  4.0930us         3  1.3640us     390ns  2.3880us  cuDeviceGetCount
                    0.00%  3.6490us         1  3.6490us  3.6490us  3.6490us  cudaEventElapsedTime
                    0.00%  2.9010us         4     725ns     547ns     900ns  cuDeviceGetUuid
$

in kernel: 57376/1380000000 = 41.5us

event: 69us

nvprof: 49.57us

note that the in-kernel measurement that is being calculated above assumes the GPU is running at basically its max clock rate. This isn't always the case, and may not be typically the case. Therefore, the implied measurement could be higher than what is calculated above (if the effective clock rate is lower than max). All recent GPUs have variable clocking schemes.

这篇关于两种计时器指示的计时不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆