OpenCL 和 CUDA 中的持久线程 [英] Persistent threads in OpenCL and CUDA

查看:27
本文介绍了OpenCL 和 CUDA 中的持久线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了一些关于 GPGPU 的持久线程"的论文,但我并不真正理解它.谁能给我一个例子或告诉我这种编程方式的用法?

I have read some papers talking about "persistent threads" for GPGPU, but I don't really understand it. Can any one give me an example or show me the use of this programming fashion?

在阅读和谷歌搜索持久线程"后我的脑海中保留的内容:

What I keep in my mind after reading and googling "persistent threads":

Presistent Threads 保持线程运行并计算大量工作只不过是一个 while 循环.

Presistent Threads it's no more than a while loop that keep thread running and computing a lot of bunch of works.

这是正确的吗?提前致谢

Is this correct? Thanks in advance

参考:http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1089http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0157-GTC2012-Persistent-Threads-Computing.pdf

推荐答案

CUDA 利用单指令多数据 (SIMD) 编程模型.计算线程按块组织,线程块分配给不同的流式多处理器 (SM).SM 上线程块的执行是通过在 32 个线程中的 warp 中安排线程来执行的:每个 warp 以锁步方式运行并准确执行不同数据的相同指令.

CUDA exploits the Single Instruction Multiple Data (SIMD) programming model. The computational threads are organized in blocks and the thread blocks are assigned to a different Streaming Multiprocessor (SM). The execution of a thread block on a SM is performed by arranging the threads in warps of 32 threads: each warp operates in lock-step and executes exactly the same instruction on different data.

一般来说,为了填满 GPU,内核启动时会包含更多实际上可以托管在 SM 上的块.由于并非所有块都可以托管在 SM 上,因此工作调度程序会在块完成计算时执行上下文切换.需要注意的是,块的切换完全由调度器在硬件中管理,程序员无法影响块如何调度到 SM 上.这暴露了所有那些不完全适合 SIMD 编程模型并且存在工作不平衡的算法的限制.实际上,块 A 不会被同一 SM 上的另一个块 B 替换,直到块 A 的最后一个线程尚未完成执行.

Generally, to fill up the GPU, the kernel is launched with much more blocks that can actually be hosted on the SMs. Since not all the blocks can be hosted on a SM, a work scheduler performs a context switch when a block has finished computing. It should be noticed that the switching of the blocks is managed entirely in hardware by the scheduler, and the programmer has no means of influencing how blocks are scheduled onto the SM. This exposes a limitation for all those algorithms that do not perfectly fit a SIMD programming model and for which there is work imbalance. Indeed, a block A will not be replaced by another block B on the same SM until the last thread of block A will not have finished to execute.

虽然 CUDA 不向程序员公开硬件调度程序,但 持久线程 样式通过依赖工作队列绕过硬件调度程序.当一个块完成时,它会检查队列是否有更多的工作,并继续这样做,直到没有工作剩下,此时块退休.这样,内核启动时的块数与可用 SM 的数量一样多.

Although CUDA does not expose the hardware scheduler to the programmer, the persistent threads style bypasses the hardware scheduler by relying on a work queue. When a block finishes, it checks the queue for more work and continues doing so until no work is left, at which point the block retires. In this way, the kernel is launched with as many blocks as the number of available SMs.

持久线程技术可以通过以下示例更好地说明,该示例取自演示文稿

The persistent threads technique is better illustrated by the following example, which has been taken from the presentation

GPGPU"计算和 CUDA/OpenCL编程模型

论文中提供了另一个更详细的示例

Another more detailed example is available in the paper

了解 GPU 上光线遍历的效率

// Persistent thread: Run until work is done, processing multiple work per thread
// rather than just one. Terminates when no more work is available

// count represents the number of data to be processed

__global__  void persistent(int* ahead, int* bhead, int count, float* a, float* b)
{
    int local_input_data_index, local_output_data_index;
while ((local_input_data_index = read_and_increment(ahead)) <   count)
{                                   
        load_locally(a[local_input_data_index]);

        do_work_with_locally_loaded_data();

        int out_index = read_and_increment(bhead);

        write_result(b[out_index]);
    }
}

// Launch exactly enough threads to fill up machine (to achieve sufficient parallelism 
// and latency hiding)
persistent<<numBlocks,blockSize>>(ahead_addr, bhead_addr, total_count, A, B);

这篇关于OpenCL 和 CUDA 中的持久线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆