开普勒-并发内核启动不重叠 [英] Kepler -Concurrent kernel launches not overlapping

查看：68 发布时间：2020/10/7 21:45:31 concurrency cuda

本文介绍了开普勒-并发内核启动不重叠的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在Kepler设备上重叠内核执行，但是从NVVP布局来看，它们似乎并不重叠。这是代码

I am trying to overlap kernel execution on Kepler device, but from NVVP layout it seems that they are not overlapping. here is the code,

#include<stdio.h>
#include<sys/time.h>
#include<time.h>


#define NY 1024
#define NX 1024
__global__ void kernel1(int j,int *A,int *b)  
{  
int i = blockIdx.x * blockDim.x + threadIdx.x;  
b[j*NY+i] = A[i*NY+j];  

}  

__global__ void kernel2(int j,int *A,int *b)  
 {  
 int i = blockIdx.x * blockDim.x + threadIdx.x;  
 for(int time=0;time<100;time++)  
 b[j*NY+i] += 10;  
 }  


 int main()  
 {  
 int nstreams=4;
 int *a, *b;
 struct timeval t1,t2;

cudaMalloc((void**)&a,NX*NY*sizeof(int));
cudaMalloc((void**)&b,NX*NY*sizeof(int));


 cudaStream_t *streams = (cudaStream_t *) malloc(nstreams * sizeof(cudaStream_t));
 for (int i = 0; i < nstreams; i++)
 { 
 cudaStreamCreate(&(streams[i]));
 }

 gettimeofday(&t1, NULL);

 for(int newvar=0;newvar<NX;newvar++)
 {
  kernel1<<<1,NY,0,streams[newvar%nstreams]>>>(newvar,a,b);
 }
 for(int newvar=0;newvar<NX;newvar++)
 {
   kernel2<<<1,NY,0,streams[newvar%nstreams]>>>(newvar,a,b);
 }  
 cudaDeviceSynchronize();  
 gettimeofday(&t2, NULL);

 return 0;

 }

请提出一些建议。
CUDA版本5.5
NVVP版本5.5 Linux计算机Ubuntu 12.10

Please suggest some tips. CUDA version 5.5 NVVP version 5.5 Linux machine Ubuntu 12.10

推荐答案

基本来说，我认为问题在于您的内核未执行够长了。内核的执行时间为几微秒，内核启动开销也为几微秒，因此您不会看到任何重叠。当API完成新内核启动的设置时，以前的内核已经完成。

Fundamentally I think the problem is that your kernels are not executing long enough. The execution time of your kernels is a few microseconds, and the kernel launch overhead is also a few microseconds, so you're not seeing any overlap. By the time the API has completed the setup of the new kernel launch, the previous kernel has finished.

我修改了 kernel1 如下：

__global__ void kernel1(int j,int *A,int *b)  
{  
  int i = blockIdx.x * blockDim.x + threadIdx.x;  
  for (int q = 0; q < 1000; q++)
    b[j*NY+i] = A[i*NY+j] + q/j;  
}

这些修改没有什么神奇或特别之处，我只是在寻找一个增加内核持续时间（从几微秒到几毫秒）的方法。

There's nothing magical or special about these modifications, I'm just looking for a way to increase the kernel duration execution (from a few microseconds to a few milliseconds).

通过上述更改，我发现您的有很好的重叠探查器中的kernel1 。

With the above changes, I saw good overlap of your kernel1 in the profiler.

I想象一下您的 kernel2 可以完成类似的操作。

I imagine something similar could be done with your kernel2.

在 nvvp 。



                        这篇关于开普勒-并发内核启动不重叠的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

开普勒-并发内核启动不重叠 [英] Kepler -Concurrent kernel launches not overlapping

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

开普勒-并发内核启动不重叠 [英] Kepler -Concurrent kernel launches not overlapping

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭