从GPU到CPU的复制比将CPU复制到GPU慢 [英] copy from GPU to CPU is slower than copying CPU to GPU

查看:533
本文介绍了从GPU到CPU的复制比将CPU复制到GPU慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经开始学习cuda一段时间了,我有以下问题



看看我的做法如下:



复制GPU

  int * B; 
// ...
int * dev_B;
// initialize B = 0

cudaMalloc((void **)& dev_B,Nel * Nface * sizeof(int));
cudaMemcpy(dev_B,B,Nel * Nface * sizeof(int),cudaMemcpyHostToDevice);
// ...

//在GPU上执行以下函数,该函数应该填充
// dev_B矩阵的整数


findNeiborElem<<< Nblocks,Nthreads>>(dev_B,dev_MSH,dev_Nel,dev_Npel,dev_Nface,dev_FC);

再次复制CPU

  cudaMemcpy(B,dev_B,Nel * Nface * sizeof(int),cudaMemcpyDeviceToHost); 




  1. 将数组B复制到dev_B只需要几分之一秒。

  2. findNeiborElem函数涉及每个线程的循环
    例如,看起来像

      __ global __ void findNeiborElem(int * dev_B,int * dev_MSH,int * dev_Nel,int * dev_Npel, int * dev_Nface,int * dev_FC){

    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    while(tid< dev_Nel [0]){
    for(int j = 1; j <= Nel; j ++){
    //做一些计算
    B tid,1,Nel)] = j // j在大多数情况下,没有一路走到Nel到达
    break;
    }
    tid + = blockDim.x * gridDim.x;
    }
    }


关于它的非常奇怪,是将dev_B复制到B的时间与j索引的迭代次数成比例。



例如,如果 Nel = 5 ,则时间约为 5秒



当我增加 Nel = 20 时,约为 20秒



我期望复制时间应该独立于内部迭代,需要分配矩阵的值 dev_B



此外,我希望将相同的矩阵复制到CPU的时间也是一样的。


解决方案

您不应使用clock()来测量时间, :



使用事件你会有这样的:

  cudaEvent_t开始,停止//包含2个事件的变量
float time; //保存时间的变量
cudaEventCreate(& start); //创建事件1
cudaEventCreate(& stop); //创建事件2
cudaEventRecord(start,0); //开始测量时间

//你想测量什么
cudaMalloc((void **)& dev_B,Nel * Nface * sizeof(int));
cudaMemcpy(dev_B,B,Nel * Nface * sizeof(int),cudaMemcpyHostToDevice);

cudaEventRecord(stop,0); //停止时间测量
cudaEventSynchronize(stop); //等待所有设备完成
//在最近一次调用cudaEventRecord()之前工作

cudaEventElapsedTime(& time,start,stop); // Saving the time measured

EDIT :附加信息:



内核启动在CPU线程完成之前将控制权返回给CPU线程,因此您的时序构造是测量内核执行时间以及第二个memcpy。内核,你的计时器代码被立即执行,但是cudaMemcpy正在等待内核在启动之前完成,这也解释了为什么你的数据返回的时序测量似乎根据内核循环迭代而变化,花在你的内核函数上的时间是可忽略的。学分 Robert Crovella


I have started learning cuda for a while and I have the following problem

See how I am doing below:

Copy GPU

int* B;
// ...
int *dev_B;    
//initialize B=0

cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);
//...

//Execute on GPU the following function which is supposed to fill in 
//the dev_B matrix with integers


findNeiborElem <<< Nblocks, Nthreads >>>(dev_B, dev_MSH, dev_Nel, dev_Npel, dev_Nface, dev_FC);

Copy CPU again

cudaMemcpy(B, dev_B, Nel*Nface*sizeof(int),cudaMemcpyDeviceToHost);

  1. Copying array B to dev_B takes only a fraction of a second. However copying array dev_B back to B takes forever.
  2. The findNeiborElem function involves a loop for each thread e.g. it looks like that

    __ global __ void findNeiborElem(int *dev_B, int *dev_MSH, int *dev_Nel, int *dev_Npel, int *dev_Nface, int *dev_FC){
    
        int tid=threadIdx.x + blockIdx.x * blockDim.x;
        while (tid<dev_Nel[0]){
            for (int j=1;j<=Nel;j++){
                 // do some calculations
                 B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach
                 break; 
            }
        tid += blockDim.x * gridDim.x; 
        }
    }
    

What's very wierd about it, is that the time to copy dev_B to B is proportional to the number of iterations of j index.

For example if Nel=5 then the time is approx 5 sec.

When I increase the Nel=20 the time is about 20 sec.

I would expect that the copy time should be independent of the inner iterations one need to assign the value of the Matrix dev_B.

Also I would expect that the time to copy the same matrix from and to CPU would be of the same order.

Do you have any idea what is wrong?

解决方案

Instead of using clock() to measure time, you should Events:

Using events you would have something like this:

  cudaEvent_t start, stop;   // variables that holds 2 events 
  float time;                // Variable that will hold the time
  cudaEventCreate(&start);   // creating the event 1
  cudaEventCreate(&stop);    // creating the event 2
  cudaEventRecord(start, 0); // start measuring  the time

  // What you want to measure
  cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
  cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);

  cudaEventRecord(stop, 0);                  // Stop time measuring
  cudaEventSynchronize(stop);               // Wait until the completion of all device 
                                            // work preceding the most recent call to cudaEventRecord()

  cudaEventElapsedTime(&time, start, stop); // Saving the time measured

EDIT : Additional information :

"The kernel launch returns control to the CPU thread before it is finished. Therefore your timing construct is measuring both the kernel execution time as well as the 2nd memcpy. When timing the copy after the kernel, your timer code is being executed immediately, but the cudaMemcpy is waiting for the kernel to complete before it starts. This also explains why your timing measurement for the data return seems to vary based on kernel loop iterations. It also explains why the time spent on your kernel function is "negligible"". credits to Robert Crovella

这篇关于从GPU到CPU的复制比将CPU复制到GPU慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆