CUDA - 块和线程 [英] CUDA - Blocks and Threads

查看：222 发布时间：2017/3/4 15:50:35 parallel-processing cuda

本文介绍了CUDA - 块和线程的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经在GPU上实现了字符串匹配算法。与算法的顺序版本相比，并行版本的搜索时间已经大大减少，但是通过使用不同数量的块和线程，我获得不同的结果。
如何确定块和thred的数量以获得最好的结果？

解决方案

@Chris点非常重要，但更依赖于算法本身。

检查Cuda手册关于内存查找的线程对齐。共享内存数组的大小也应为16的倍数。

使用合并的全局内存读取。但是通过算法设计，这通常是这种情况，并且使用共享内存有帮助。

不要在全局内存中使用原子操作，他们很慢。一些使用原子操作的算法可以使用不同的技术重写。

没有显示的代码，

$ b
计算该值的重要值是：

每个多处理器的最大常驻线程数 每个多处理器的驻留程序块的最大数量

每个块的最大线程数<

每个多处理器的32位寄存器数 >您的算法应该可以扩展到所有GPU达到100％的占用率。为此，我创建了一个帮助类，它自动检测所使用的GPU的最佳线程数，并将其作为DEFINE传递给内核。
/ ** *块中的线程数 * *每个多处理器的最大驻留块数：8 * * //// /////////////// *计算能力： * /////////////////// * * Cuda [1.0 - 1.1] = *每个多处理器的最大驻留线程数768 *最佳用法：768/8 = 96 * Cuda [1.2 - 1.3 ] = *每个多处理器的最大驻留线程数1024 *最佳用法：1024/8 = 128 * Cuda [2.x] = *最大驻留线程数每个多处理器1536 *最佳用法：1536/8 = 192 * / public static int BLOCK_SIZE_DEF = 96;
示例Cuda 1.1每SM达到786个常驻线程

8块* 96个线程每个块= 786个线程

3个块* 256个线程每个块= 786个线程

1块* 512个线程每个块= 512个线程< - 33％的GPU将空闲

这本书也提到：

编程大规模并行处理器：实用方法计算系列）

良好的编程建议：

分析你的内核代码，记下它可以处理的最大线程数或者可以处理多少单位。

还输出您的注册表用法，并尝试将其降低到相应的目标CUDA版本。因为如果你在内核中使用太多的寄存器，将会执行更少的块，从而导致占用和性能降低。

示例：使用Cuda 1.1并使用最佳数量的768个驻留线程SM你有8192个寄存器要使用。这导致每个线程/内核8192/768 = 10个最大寄存器。如果您使用11，GPU将使用1 Block减少导致性能下降。

示例：。
/ * * ////////////////// ////// * //计算能力// * //////////////////////////// * *使用12个寄存器，540 + 16字节smem，36字节cmem [1] *使用10个寄存器，540 + 16字节smem，36字节cmem [1] 10 Cuda的限制1.1 * I：最大行数= max（x-dim）^ max（dimGrid） * II：最大列数= unlimited，因为它们被加载到瓦片循环 * * Cuda [1.0 - 1.3]： * I：65535 ^ 2 = 4.294.836.225 * * Cuda [2.0]： * II：65535 ^ 3 = 281.462.092.005.375 * /

I have implemented a string matching algorithm on the GPU. The searching time of a parallel version has been decreased considerably compared with the sequential version of the algorithm, but by using different number of blocks and threads I get different results. How can I determine the number of the blocks and threds to get the best results?
解决方案
@Chris points are very important too but depend more on the algorithm itself.

Check the Cuda Manual about Thread alignment regarding memory lookups. Shared Memory Arrays should also be size of multiple of 16.

Use Coalesced global memory reads. But by algorithm design this is often the case and using shared memory helps.

Don't use atomic operations in global memory or at all if possible. They are very slow. Some algorithms using atomic operations can be rewritten using different techniques.

Without shown code no-one can tell you what is the best or why performance changes.

The number of threads per block of your kernel is the most important value.

Important values to calculate that value are:

Maximum number of resident threads per multiprocessor

Maximum number of resident blocks per multiprocessor

Maximum number of threads per block

Number of 32-bit registers per multiprocessor

Your algorithms should be scalable across all GPU's reaching 100% occupancy. For this I created myself a helper class which automatically detects the best thread numbers for the used GPU and passes it to the Kernel as a DEFINE.
/** * Number of Threads in a Block * * Maximum number of resident blocks per multiprocessor : 8 * * /////////////////// * Compute capability: * /////////////////// * * Cuda [1.0 - 1.1] = * Maximum number of resident threads per multiprocessor 768 * Optimal Usage: 768 / 8 = 96 * Cuda [1.2 - 1.3] = * Maximum number of resident threads per multiprocessor 1024 * Optimal Usage: 1024 / 8 = 128 * Cuda [2.x] = * Maximum number of resident threads per multiprocessor 1536 * Optimal Usage: 1536 / 8 = 192 */ public static int BLOCK_SIZE_DEF = 96;
Example Cuda 1.1 to reach 786 resident Threads per SM

8 Blocks * 96 Threads per Block = 786 threads

3 Blocks * 256 Threads per Block = 786 threads

1 Blocks * 512 Threads per Block = 512 threads <- 33% of GPU will be idle

This is also mentioned in the book:

Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)

Good programming advices:

Analyse your kernel code and write down the maximal number of threads it can handle or how many "units" it can process.

Also output your register usage and try to lower it to the respective targeted CUDA version. Because if you use too many registers in your kernel less blocks will be executed resulting in less occupancy and performance.
Example: Using Cuda 1.1 and using optimal number of 768 resident threads per SM you have 8192 registers to use. This leads to 8192 / 768 = 10 maximum registers per thread/kernel. If you use 11 the GPU will use 1 Block less resulting in decreased performance.

Example: A matrix independent row vector normalizing kernel of mine.
/* * //////////////////////// * // Compute capability // * //////////////////////// * * Used 12 registers, 540+16 bytes smem, 36 bytes cmem[1] * Used 10 registers, 540+16 bytes smem, 36 bytes cmem[1] <-- with -maxregcount 10 Limit for Cuda 1.1 * I: Maximum number of Rows = max(x-dim)^max(dimGrid) * II: Maximum number of Columns = unlimited, since they are loaded in a tile loop * * Cuda [1.0 - 1.3]: * I: 65535^2 = 4.294.836.225 * * Cuda [2.0]: * II: 65535^3 = 281.462.092.005.375 */

这篇关于CUDA - 块和线程的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

CUDA - 块和线程 [英] CUDA - Blocks and Threads

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

CUDA - 块和线程 [英] CUDA - Blocks and Threads

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭