cuda内核不同时执行 [英] cuda kernels not executing concurrently

查看:172
本文介绍了cuda内核不同时执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图探索我的Nvidia Quadro 4000的并发内核执行属性,它有2.0的能力。



我使用两个不同的流,运行相同如下:


  1. 复制H2D两个不同的固定内存块

  2. 运行内核

  3. 将D2H两个不同的块拷贝到固定的内存。

两个流的内核完全相同, ms执行时间。



在Visual Profiler(版本5.0)中,我希望两个内核同时开始执行,但它们只重叠20毫秒。
这里是代码示例:

 在此输入代码

//启动流
cudaStream_t stream0,stream1;
CHK_ERR(cudaStreamCreate(& stream0));
CHK_ERR(cudaStreamCreate(& stream1));
//为GPU分配内存stream0
CHK_ERR(cudaMalloc((void **)& def_img0,width * height * sizeof(char)));
CHK_ERR(cudaMalloc((void **)& ref_img0,width * height * sizeof(char)));
CHK_ERR(cudaMalloc((void **)& outY_img0,width_size_for_out * height_size_for_out * sizeof(char)));
CHK_ERR(cudaMalloc((void **)& outX_img0,width_size_for_out * height_size_for_out * sizeof(char)));
//在GPU上为stream1分配内存
CHK_ERR(cudaMalloc((void **)& def_img1,width * height * sizeof(char)));
CHK_ERR(cudaMalloc((void **)& ref_img1,width * height * sizeof(char)));
CHK_ERR(cudaMalloc((void **)& outY_img1,width_size_for_out * height_size_for_out * sizeof(char)));
CHK_ERR(cudaMalloc((void **)& outX_img1,width_size_for_out * height_size_for_out * sizeof(char)));

//为stream0分配页锁定内存
CHK_ERR(cudaHostAlloc((void **)& host01,width * height * sizeof(char),cudaHostAllocDefault)
CHK_ERR(cudaHostAlloc((void **)& host02,width * height * sizeof(char),cudaHostAllocDefault));
CHK_ERR(cudaHostAlloc((void **)& host03,width_size_for_out * height_size_for_out * sizeof(char),cudaHostAllocDefault));
CHK_ERR(cudaHostAlloc((void **)& host04,width_size_for_out * height_size_for_out * sizeof(char),cudaHostAllocDefault));

//为stream1分配页锁定内存
CHK_ERR(cudaHostAlloc((void **)& host11,width * height * sizeof(char),cudaHostAllocDefault)
CHK_ERR(cudaHostAlloc((void **)& host12,width * height * sizeof(char),cudaHostAllocDefault));
CHK_ERR(cudaHostAlloc((void **)& host13,width_size_for_out * height_size_for_out * sizeof(char),cudaHostAllocDefault));
CHK_ERR(cudaHostAlloc((void **)& host14,width_size_for_out * height_size_for_out * sizeof(char),cudaHostAllocDefault));


memcpy(host01,in1,width * height * sizeof(char));
memcpy(host02,in2,width * height * sizeof(char));

memcpy(host11,in1,width * height * sizeof(char));
memcpy(host12,in2,width * height * sizeof(char));



cudaEvent_t start,stop;
float time;
cudaEventCreate(& start);
cudaEventCreate(& stop);

dim3 dimBlock(CUDA_BLOCK_DIM,CUDA_BLOCK_DIM);
dim3 Grid((width-SEARCH_RADIUS * 2-1)/(dimBlock.x * 4)+1,(height-SEARCH_RADIUS * 2-1)/(dimBlock.y * 4)+1)

cudaEventRecord(start,0);
// --------------------
//将图像复制到设备
// -------- ------------
// def stream0和stream1的入队副本
CHK_ERR(cudaMemcpyAsync(def_img0,host01,width * height * sizeof(char),cudaMemcpyHostToDevice,stream0) );
CHK_ERR(cudaMemcpyAsync(def_img1,host11,width * height * sizeof(char),cudaMemcpyHostToDevice,stream1));
//引用ref stream0和stream1的副本队列
CHK_ERR(cudaMemcpyAsync(ref_img0,host02,width * height * sizeof(char),cudaMemcpyHostToDevice,stream0)
CHK_ERR(cudaMemcpyAsync(ref_img1,host12,width * height * sizeof(char),cudaMemcpyHostToDevice,stream1));

CHK_ERR(cudaStreamSynchronize(stream0));
CHK_ERR(cudaStreamSynchronize(stream1));

// CALLING KERNEL
// stream0和stream1中的入队内核
TIME_KERNEL((exhaustiveSearchKernel< CUDA_BLOCK_DIM * 4,CUDA_BLOCK_DIM * 4,SEARCH_RADIUS>< ,dimBlock,0,stream0>(def_img0 + SEARCH_RADIUS * width + SEARCH_RADIUS,ref_img0,outX_img0,outY_img0,width,width_size_for_out),exhaustiveSearchKernel stream0);
TIME_KERNEL((exhaustiveSearchKernel

//复制结果返回
CHK_ERR(cudaMemcpyAsync(host03,outX_img0,width_size_for_out * height_size_for_out * sizeof(char),cudaMemcpyDeviceToHost,stream0));
CHK_ERR(cudaMemcpyAsync(host13,outX_img1,width_size_for_out * height_size_for_out * sizeof(char),cudaMemcpyDeviceToHost,stream1));

CHK_ERR(cudaMemcpyAsync(host04,outY_img0,width_size_for_out * height_size_for_out * sizeof(char),cudaMemcpyDeviceToHost,stream0));
CHK_ERR(cudaMemcpyAsync(host14,outY_img1,width_size_for_out * height_size_for_out * sizeof(char),cudaMemcpyDeviceToHost,stream1));


CHK_ERR(cudaStreamSynchronize(stream0));
CHK_ERR(cudaStreamSynchronize(stream1));

cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(& time,start,stop);
printf(Elapsed time =%f ms\\\
,time);

memcpy(outX,host03,width_size_for_out * height_size_for_out * sizeof(char));
memcpy(outY,host04,width_size_for_out * height_size_for_out * sizeof(char));


cudaEventDestroy(start);
cudaEventDestroy(stop);
CHK_ERR(cudaStreamDestroy(stream0));
CHK_ERR(cudaStreamDestroy(stream1));

CHK_ERR(cudaDeviceReset());


}


解决方案

计算能力2.x-3.0



计算能力2.x-3.0设备具有单个硬件工作队列。 CUDA驱动程序将命令推入工作队列。 GPU主机读取命令并将工作分派给复制引擎或CUDA工作分配器(CWD)。 CUDA驱动程序将同步命令插入到硬件工作队列中,以确保同一个流上的工作不能同时运行。



当网格太小而无法填充整个GPU时,并发内核执行可提高GPU利用率,当网格具有尾部效果时(线程块的子集的执行时间比其他线程块长得多)。



情况1: >



如果应用程序在同一个流上启动两个kernesl,CUDA驱动程序插入的同步命令将不会将第二个内核分派给CWD,直到第一个内核



如果应用程序在不同的流上启动两个内核,主机将读取命令并将命令分派给CWD。 CWD将栅格化第一个网格(顺序与架构相关),并将线程块分配给SM。



计算能力3.5 只有当第一个网格的所有线程块都被分派时,CWD才会从第二个网格分派线程块。 / p>

计算能力3.5引入了几个新功能来提高GPU利用率。这些包括:
- HyperQ支持多个独立的硬件工作队列。
- 动态并行性允许设备代码启动新工作。
- CWD容量已增加到32个网格。



资源




I'm trying to explore the concurrent kernels execution property of my Nvidia Quadro 4000, which has 2.0 capability.

I use 2 different streams, which run the same as follows:

  1. Copy H2D two different chunks of pinned memory
  2. Run kernel
  3. Copyt D2H two different chunks to pinned memory.

Kernels of both streams are exactly the same and have 190 ms execution time each.

In the Visual profiler (version 5.0) I expected both kernels to start execution simultaneously, however they overlap only by 20 ms. here is the code sample :

enter code here

//initiate the streams
        cudaStream_t stream0,stream1;
        CHK_ERR(cudaStreamCreate(&stream0));
        CHK_ERR(cudaStreamCreate(&stream1));
        //allocate the memory on the GPU for stream0
        CHK_ERR(cudaMalloc((void **)&def_img0, width*height*sizeof(char)));
        CHK_ERR(cudaMalloc((void **)&ref_img0, width*height*sizeof(char)));
        CHK_ERR(cudaMalloc((void **)&outY_img0,width_size_for_out*height_size_for_out*sizeof(char)));
        CHK_ERR(cudaMalloc((void **)&outX_img0,width_size_for_out*height_size_for_out*sizeof(char)));
        //allocate the memory on the GPU for stream1
        CHK_ERR(cudaMalloc((void **)&def_img1, width*height*sizeof(char)));
        CHK_ERR(cudaMalloc((void **)&ref_img1, width*height*sizeof(char)));
        CHK_ERR(cudaMalloc((void **)&outY_img1,width_size_for_out*height_size_for_out*sizeof(char)));
        CHK_ERR(cudaMalloc((void **)&outX_img1,width_size_for_out*height_size_for_out*sizeof(char)));

        //allocate page-locked memory for stream0
        CHK_ERR(cudaHostAlloc((void**)&host01, width*height*sizeof(char), cudaHostAllocDefault));
        CHK_ERR(cudaHostAlloc((void**)&host02, width*height*sizeof(char), cudaHostAllocDefault));
        CHK_ERR(cudaHostAlloc((void**)&host03, width_size_for_out*height_size_for_out*sizeof(char), cudaHostAllocDefault));
        CHK_ERR(cudaHostAlloc((void**)&host04, width_size_for_out*height_size_for_out*sizeof(char), cudaHostAllocDefault));

        //allocate page-locked memory for stream1
        CHK_ERR(cudaHostAlloc((void**)&host11, width*height*sizeof(char), cudaHostAllocDefault));
        CHK_ERR(cudaHostAlloc((void**)&host12, width*height*sizeof(char), cudaHostAllocDefault));
        CHK_ERR(cudaHostAlloc((void**)&host13, width_size_for_out*height_size_for_out*sizeof(char), cudaHostAllocDefault));
        CHK_ERR(cudaHostAlloc((void**)&host14, width_size_for_out*height_size_for_out*sizeof(char), cudaHostAllocDefault));


        memcpy(host01,in1,width*height*sizeof(char));
        memcpy(host02,in2,width*height*sizeof(char));

        memcpy(host11,in1,width*height*sizeof(char));
        memcpy(host12,in2,width*height*sizeof(char));



        cudaEvent_t start, stop;
        float time;
        cudaEventCreate(&start);
        cudaEventCreate(&stop);

        dim3 dimBlock(CUDA_BLOCK_DIM, CUDA_BLOCK_DIM);
        dim3 Grid((width-SEARCH_RADIUS*2-1)/(dimBlock.x*4)+1, (height-SEARCH_RADIUS*2-1)/(dimBlock.y*4)+1);

        cudaEventRecord(start,0);
        // --------------------
        // Copy images to device
        // --------------------
        //enqueue copies of def stream0 and stream1
        CHK_ERR(cudaMemcpyAsync(def_img0, host01,width*height*sizeof(char), cudaMemcpyHostToDevice,stream0));
        CHK_ERR(cudaMemcpyAsync(def_img1, host11,width*height*sizeof(char), cudaMemcpyHostToDevice,stream1));
        //enqueue copies of ref stream0 and stream1
        CHK_ERR(cudaMemcpyAsync(ref_img0, host02,width*height*sizeof(char), cudaMemcpyHostToDevice,stream0));
        CHK_ERR(cudaMemcpyAsync(ref_img1, host12,width*height*sizeof(char), cudaMemcpyHostToDevice,stream1));

        CHK_ERR(cudaStreamSynchronize(stream0));
        CHK_ERR(cudaStreamSynchronize(stream1));

        //CALLING KERNEL
        //enqueue kernel in stream0 and stream1
        TIME_KERNEL((exhaustiveSearchKernel<CUDA_BLOCK_DIM*4,CUDA_BLOCK_DIM*4,SEARCH_RADIUS><<< Grid, dimBlock,0,stream0>>>(def_img0+SEARCH_RADIUS*width+SEARCH_RADIUS,ref_img0,outX_img0,outY_img0,width,width_size_for_out)),"exhaustiveSearchKernel stream0");
        TIME_KERNEL((exhaustiveSearchKernel<CUDA_BLOCK_DIM*4,CUDA_BLOCK_DIM*4,SEARCH_RADIUS><<< Grid, dimBlock,0,stream1>>>(def_img1+SEARCH_RADIUS*width+SEARCH_RADIUS,ref_img1,outX_img1,outY_img1,width,width_size_for_out)),"exhaustiveSearchKernel stream1");


        //Copy result back
        CHK_ERR(cudaMemcpyAsync(host03, outX_img0, width_size_for_out*height_size_for_out*sizeof(char), cudaMemcpyDeviceToHost,stream0));
        CHK_ERR(cudaMemcpyAsync(host13, outX_img1, width_size_for_out*height_size_for_out*sizeof(char), cudaMemcpyDeviceToHost,stream1));

        CHK_ERR(cudaMemcpyAsync(host04, outY_img0, width_size_for_out*height_size_for_out*sizeof(char), cudaMemcpyDeviceToHost,stream0));
        CHK_ERR(cudaMemcpyAsync(host14, outY_img1, width_size_for_out*height_size_for_out*sizeof(char), cudaMemcpyDeviceToHost,stream1));


        CHK_ERR(cudaStreamSynchronize(stream0));
        CHK_ERR(cudaStreamSynchronize(stream1));

        cudaEventRecord( stop, 0 );
        cudaEventSynchronize( stop );
        cudaEventElapsedTime( &time, start, stop );
        printf("Elapsed time=%f ms\n",time);

        memcpy(outX,host03,width_size_for_out*height_size_for_out*sizeof(char));
        memcpy(outY,host04,width_size_for_out*height_size_for_out*sizeof(char));


        cudaEventDestroy( start ); 
        cudaEventDestroy( stop );
        CHK_ERR(cudaStreamDestroy(stream0));
        CHK_ERR(cudaStreamDestroy(stream1));

        CHK_ERR(cudaDeviceReset());


    } 

解决方案

Compute Capability 2.x-3.0

Compute capability 2.x-3.0 devices have a single hardware work queue. The CUDA driver pushes commands into the work queue. The GPU host reads the commands and dispatches the work to the copy engines or the CUDA Work Distributor (CWD). The CUDA driver inserts synchronization commands into the hardware work queue to guarantee that work on the same stream is not able to run concurrently. When the host hits a synchronization command it will stall until the dependent work is completed.

Concurrent kernel execution improves GPU utilization when a grid is too small to fill the entire GPU or when grids have tail effect (subset of thread blocks execute much longer than other thread blocks).

Case 1: Back to back kernels on one stream

If an application launches two kernesl back to back on the same stream the synchronization command inserted by the CUDA driver will not dispatch the 2nd kernel to CWD until the first kernel has completed.

Case 2: Back to back kernel launches on two streams

If an application launches two kernels on different streams the host will reads the commands and dispatch the commands to CWD. CWD will rasterize the first grid (order is architecture dependent) and dispatch thread blocks to the SMs. Only when all of the threads blocks from the first grid have been dispatched will CWD dispatch thread blocks from the second grid.

Compute Capability 3.5

Compute capability 3.5 introduced several new features to improve GPU utilization. These include: - HyperQ supports multiple independent hardware work queues. - Dynamic Parallelism allows for device code to launch new work. - CWD capacity was increased to 32 grids.

Resources

这篇关于cuda内核不同时执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆