OpenCL和CUDA中的持久性线程 [英] Persistent threads in OpenCL and CUDA

查看:476
本文介绍了OpenCL和CUDA中的持久性线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经读过一些论文谈论持久线程的GPGPU,但我真的不明白。任何一个人给我一个例子,或者告诉我使用这种编程时尚?

I have read some papers talking about "persistent threads" for GPGPU, but I don't really understand it. Can any one give me an example or show me the use of this programming fashion?

读取和搜索持久线程后,我在心中留下了什么:

What I keep in my mind after reading and googling "persistent threads":


Presistent Threads它只是一个while循环,保持线程运行和计算很多工作。

Presistent Threads it's no more than a while loop that keep thread running and computing a lot of bunch of works.

这是否正确?提前感谢

参考: http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1089
http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0157-GTC2012-Persistent-Threads-Computing .pdf

推荐答案

CUDA利用单指令多数据(SIMD)编程模型。计算线程被组织在块中,并且线程块被分配给不同的流处理多处理器(SM)。通过将线程排列在 32 线程的 warp 中来执行SM上的线程块的执行:每个warp在锁步中操作并执行正是
相同的指令对不同的数据。

CUDA exploits the Single Instruction Multiple Data (SIMD) programming model. The computational threads are organized in blocks and the thread blocks are assigned to a different Streaming Multiprocessor (SM). The execution of a thread block on a SM is performed by arranging the threads in warps of 32 threads: each warp operates in lock-step and executes exactly the same instruction on different data.

通常,为了填满GPU,内核启动时会有更多的块,这些块实际上可以托管在SM上。由于不是所有的块都可以托管在SM上,因此当块已经完成计算时,工作调度器执行上下文切换。应当注意,块的切换完全由调度器在硬件中管理,并且编程器没有影响块如何被调度到SM上的手段。这暴露了对于不完全适合SIMD编程模型并且存在工作不平衡的所有算法的限制。确实,块 A 不会被同一SM上的另一个块 B 替换,直到块 A

Generally, to fill up the GPU, the kernel is launched with much more blocks that can actually be hosted on the SMs. Since not all the blocks can be hosted on a SM, a work scheduler performs a context switch when a block has finished computing. It should be noticed that the switching of the blocks is managed entirely in hardware by the scheduler, and the programmer has no means of influencing how blocks are scheduled onto the SM. This exposes a limitation for all those algorithms that do not perfectly fit a SIMD programming model and for which there is work imbalance. Indeed, a block A will not be replaced by another block B on the same SM until the last thread of block A will not have finished to execute.

虽然CUDA没有将硬件调度程序公开给程序员, >持久线程样式通过依赖工作队列绕过硬件调度程序。当块完成时,它检查队列以进行更多的工作,并继续这样做,直到没有剩下工作,在该点退出。这样,内核被启动时具有与可用SM数量一样多的块。

Although CUDA does not expose the hardware scheduler to the programmer, the persistent threads style bypasses the hardware scheduler by relying on a work queue. When a block finishes, it checks the queue for more work and continues doing so until no work is left, at which point the block retires. In this way, the kernel is launched with as many blocks as the number of available SMs.

以下示例来自演示

GPGPU计算和CUDA / OpenCL编程模型

另一个更详细的示例

了解GPU上光线遍历的效率

// Persistent thread: Run until work is done, processing multiple work per thread
// rather than just one. Terminates when no more work is available

// count represents the number of data to be processed

__global__  void persistent(int* ahead, int* bhead, int count, float* a, float* b)
{
    int local_input_data_index, local_output_data_index;
while ((local_input_data_index = read_and_increment(ahead)) <   count)
{                                   
        load_locally(a[local_input_data_index]);

        do_work_with_locally_loaded_data();

        int out_index = read_and_increment(bhead);

        write_result(b[out_index]);
    }
}

// Launch exactly enough threads to fill up machine (to achieve sufficient parallelism 
// and latency hiding)
persistent<<numBlocks,blockSize>>(ahead_addr, bhead_addr, total_count, A, B);

这篇关于OpenCL和CUDA中的持久性线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆