流式多处理器、块和线程 (CUDA) [英] Streaming multiprocessors, Blocks and Threads (CUDA)

查看:19
本文介绍了流式多处理器、块和线程 (CUDA)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

CUDA 内核、流式多处理器和块和线程的 CUDA 模型之间有什么关系?

What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?

什么被映射到什么,什么被并行化以及如何?还有什么更高效,最大化块数还是线程数?

What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?

我目前的理解是每个多处理器有 8 个 cuda 内核.并且每个 cuda 核心将能够一次执行一个 cuda 块.并且该块中的所有线程都在该特定核心中串行执行.

My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.

这对吗?

推荐答案

线程/块布局在CUDA 编程指南.特别是,第 4 章指出:

The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:

CUDA 架构是围绕可扩展的多线程流式多处理器 (SM) 阵列构建的.当主机 CPU 上的 CUDA 程序调用内核网格时,网格的块被枚举并分发到具有可用执行能力的多处理器.一个线程块的线程在一个多处理器上并发执行,多个线程块可以在一个多处理器上并发执行.当线程块终止时,新的块会在空出的多处理器上启动.

The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

每个 SM 包含 8 个 CUDA 内核,并且在任何时候它们都在执行 32 个线程的单个 warp - 因此为整个 warp 发出一条指令需要 4 个时钟周期.您可以假设任何给定 warp 中的线程以锁步执行,但要跨 warp 同步,您需要使用 __syncthreads().

Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().

这篇关于流式多处理器、块和线程 (CUDA)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆