如何衡量CUDA时间是否正确? [英] How to measure CUDA times correctly?

查看:304
本文介绍了如何衡量CUDA时间是否正确?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试着去衡量并行和串行执行正确的时间,但我是因为怀疑:

假设我们有以下的code:

  //获取时间
    clock_t表示开始,结束;
    双totaltime;
    启动=时钟();

    双* D_A,* d_B,* D_X;

    cudaMalloc((无效**)及D_A,的sizeof(双)*宽*宽);
    cudaMalloc((无效**)及d_B,的sizeof(双)*宽);
    cudaMalloc((无效**)及D_X,的sizeof(双)*宽);

    cudaMemcpy(D_A,A的sizeof(双)*宽*宽,cudaMemcpyHostToDevice);
    cudaMemcpy(d_B,B,的sizeof(双)*宽,cudaMemcpyHostToDevice);


    do_parallel_matmul<<< dimB,dimT>>>(D_A,d_B,D_X,宽度);


    cudaMemcpy(X,D_X,的sizeof(双)*宽,cudaMemcpyDeviceToHost);

    完成=时钟();

    totaltime =(双)(完成启动)/ CLOCKS_PER_SEC;

    的printf(%F,totaltime);
 

这个时间远远长于测量连续的时间如下:

  clock_t表示开始,结束;
双totaltime;
启动=时钟();

do_seq_matmult();

完成=时钟();

totaltime =(双)(完成启动)/ CLOCKS_PER_SEC;

的printf(%F,totaltime);
 

所以我不知道我是否应该只测量CUDA内核时间如下:

  clock_t表示开始,结束;
双totaltime;
启动=时钟();

do_parallel_matmul();

完成=时钟();

totaltime =(双)(完成启动)/ CLOCKS_PER_SEC;

的printf(%F,totaltime);
 

和避免主机与设备之间的内存拷贝...

我问上面,因为我已经提交并行执行和顺序执行的一个...作仪,但如果我在CUDA衡量内存拷贝没有CUDA和C ...

编辑:

 无效do_seq_matmult(常量双* A,常量双* X,双* resul,const int的TAM)
{
    * resul = 0;
    的for(int i = 0; I< TAM;我++)
    {
        对于(INT J = 0; J< TAM; J ++)
        {
            如果(我!= j)条
                * resul + = A [I * TAM + J] * X [J]。
        }
    }
}

__global__无效do_parallel_matmul(双* mat_A,
                            双* VEC,
                            双* RST,
                            INT DIM)
{
     INT rowIdx = threadIdx.x + blockIdx.x * blockDim.x; //获取行索引
     INT aIdx;
     而(rowIdx< D​​IM)
     {
          首先[rowIdx] = 0; //清洁起初的价值
          的for(int i = 0; I<暗淡;我++)
          {
               aIdx = rowIdx *朦胧+ I; //获取指数为元素A_ {rowIdx,我}
               RST [rowIdx] + =(mat_A [aIdx] * VEC [I]); //做乘法
          }
          rowIdx + = gridDim.x * blockDim.x;
     }
     __syncthreads();
}
 

解决方案

若干思考:

  1. 这是不公平的一次设备内存的分配,并与CPU相比无记忆的主机分配。

  2. 如果 cudaMalloc((无效**)及D_A,的sizeof(双)*宽*宽); 是第一个CUDA称其将包括CUDA上下文创建这可能是一个显著的开销。

  3. 时序cudamemcpy是不公平的CPU / GPU比较,因为这个时间将取决于系统的PCI-E的带宽。在另一方面,如果你看到的内核但从CPU的点的加速,你将需要包括的memcpy。为了达到顶峰的PCI-E带宽,使用页面锁定内存。

  4. 如果您的应用程序将会比你必须overlaping副本内核执行隐藏大部分的memcpy的运行能力倍增几次。这是更好的特斯拉单元,你有双DMA引擎。

  5. 定时内核本身会要求你停止计时器前同步GPU的CPU,否则你只会在内核发射本身,而不是执行。从调用CPU内核是异步的。如果你想时间对GPU使用cud​​aEvents内核执行。

  6. 在GPU上运行多个线程得到一个公平的比较。

  7. 改进的内核,你可以做的更好。

Im trying to measure correctly the times of parallel and sequential executions, but I am in doubt because of:

Suppose we have the following code:

    //get the time
    clock_t start,finish;
    double totaltime;
    start = clock(); 

    double *d_A, *d_B, *d_X;

    cudaMalloc((void**)&d_A, sizeof(double) * Width * Width);
    cudaMalloc((void**)&d_B, sizeof(double) * Width);
    cudaMalloc((void**)&d_X, sizeof(double) * Width);

    cudaMemcpy(d_A, A, sizeof(double) * Width * Width, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, sizeof(double) * Width, cudaMemcpyHostToDevice);  


    do_parallel_matmul<<<dimB, dimT>>>(d_A, d_B, d_X, Width);   


    cudaMemcpy(X, d_X, sizeof(double) * Width, cudaMemcpyDeviceToHost);

    finish = clock();

    totaltime=(double)(finish-start)/CLOCKS_PER_SEC;   

    printf("%f", totaltime);

This time is much longer than sequential time measured as follows:

clock_t start,finish;
double totaltime;
start = clock(); 

do_seq_matmult();

finish = clock();

totaltime=(double)(finish-start)/CLOCKS_PER_SEC;   

printf("%f", totaltime);

So I don't know if I should only measure the CUDA kernel time as follows:

clock_t start,finish;
double totaltime;
start = clock(); 

do_parallel_matmul();

finish = clock();

totaltime=(double)(finish-start)/CLOCKS_PER_SEC;   

printf("%f", totaltime);

and avoid memory copies between host and device...

I'm asking the above because I've to submit a comparission between parallel executions and sequential executions... But if I measure memory copies in CUDA there isn't a good difference between CUDA and C...

EDIT:

void do_seq_matmult(const double *A, const double *X, double *resul, const int tam)
{
    *resul = 0;
    for(int i = 0; i < tam; i++)
    {
        for(int  j = 0; j < tam; j++)
        {
            if(i != j)
                *resul += A[i * tam + j] * X[j];
        }
    }
}

__global__ void do_parallel_matmul( double * mat_A, 
                            double * vec, 
                            double * rst, 
                            int dim)
{
     int rowIdx = threadIdx.x + blockIdx.x * blockDim.x; // Get the row Index 
     int aIdx;
     while( rowIdx < dim)
     {
          rst[rowIdx] = 0; // clean the value at first
          for (int i = 0; i < dim; i++)
          {
               aIdx = rowIdx * dim + i; // Get the index for the element a_{rowIdx, i}
               rst[rowIdx] += (mat_A[aIdx] * vec[i] ); // do the multiplication
          }
          rowIdx += gridDim.x * blockDim.x;
     }
     __syncthreads();
}

解决方案

Some thoughts:

  1. It is not fair to time the allocation of device memory and compare it with CPU without the host allocation of memory.

  2. If cudaMalloc((void**)&d_A, sizeof(double) * Width * Width); is the first CUDA call it will include the CUDA context creation which could be a significant overhead.

  3. Timing cudamemcpy is not a fair CPU/GPU comparison because this time will depend on the PCI-e bandwidth of the system. On the other hand if you see the kernel as an acceleration from the CPU point of view you will need to include the memcpy. In order to peak PCI-e bandwidth, use page-locked memory.

  4. If your application is going to run the multiplication several times than you have the ability to hide most of the memcpy by overlaping copy with kernel execution. This is even better on a Tesla unit where you have dual DMA engines.

  5. Timing the kernel itself will require you to synchronize the CPU with GPU before stopping the timer, otherwise you will only time the kernel launch itself and not the execution. Calling a kernel from CPU is asynchronous. IF you want to time the kernel execution on the GPU use cudaEvents.

  6. Run many threads on GPU to get a fair comparison.

  7. Improve kernel, you can do better.

这篇关于如何衡量CUDA时间是否正确?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆