时序内核启动CUDA中同时使用推力 [英] Timing Kernel launches in CUDA while using Thrust

查看:150
本文介绍了时序内核启动CUDA中同时使用推力的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

内核CUDA启动一般都是异步的,它(据我所知)意味着,一旦CUDA内核立即展开控制权返回给CPU。 CPU继续做一些有益的工作,而GPU忙的数字运算
除非该CPU采用强行停滞的cudaThreadSynchronize() cudaMemcpy()

Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once the CUDA kernel is launched control returns immediately to the CPU. The CPU continues doing some useful work while the GPU is busy number crunching unless the CPU is forcefully stalled using cudaThreadsynchronize() or cudaMemcpy() .

现在我刚开始使用推力获得CUDA库。在推力的函数调用
同步或异步?

Now I have just started using the Thrust library for CUDA. Are the function calls in Thrust synchronous or asynchronous?

在换句话说,如果我调用推力::排序(D.begin(),D.end()); D是一个设备载体,不是有意义的使用来衡量分拣时间

In other words, if I invoke thrust::sort(D.begin(),D.end()); where D is a device vector, does it make sense to measure the sorting time using

        start = clock();//Start

             thrust::sort(D.begin(),D.end());

        diff = ( clock() - start ) / (double)CLOCKS_PER_SEC;
        std::cout << "\nDevice Time taken is: " <<diff<<std::endl;

如果该函数调用是异步然后差异将是任何向量0秒(这是定时垃圾),但如果它是同步的,我将确实获得实时性能

If the function call is asynchronous then diff will be 0 seconds for any vector (which is junk for timings), but if it is synchronous I will indeed get the real time performance.

推荐答案

推力调用,调用内核是异步的,就像底层API的CUDA推力用途。该数据复制推力调用是同步的,就像底层API的CUDA推力使用。

Thrust calls which invoke kernels are asynchronous, just like the underlying CUDA APIs thrust uses. Thrust calls which copy data are synchronous, just like the underlying CUDA APIs thrust uses.

所以,你的例子只会测量内核启动和推力主机端安装费用,而不是手术本身。对于时机,可以通过调用解决这个问题无论是的cudaThreadSynchronize cudaDeviceSynchronize 后(后来在CUDA 4.0或更高版本)推力内核启动。另外,如果你有一个帖子内核启动复制操作,并记录后,即停的时候,您的时间安排将包括建立,执行和复制时间。

So your example would only be measuring the kernel launch and thrust host side setup overheads, not the operation itself. For timing, you can get around this by calling either cudaThreadSynchronize or cudaDeviceSynchronize (the later in CUDA 4.0 or later) after the thrust kernel launch. Alternatively, if you include a post kernel launch copy operation and record the stop time after that, your timing will include setup, execution, and copying time.

在你的榜样,这会看起来像

In your example this would look something like

   start = clock();//Start 

   thrust::sort(D.begin(),D.end()); 
   cudaThreadSynchronize(); // block until kernel is finished

   diff = ( clock() - start ) / (double)CLOCKS_PER_SEC; 
   std::cout << "\nDevice Time taken is: " <<diff<<std::endl; 

这篇关于时序内核启动CUDA中同时使用推力的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆