如何调整块和线程的CUDA数以获得最佳性能 [英] How to adjust the cuda number of block and of thread to get optimal performances

查看:252
本文介绍了如何调整块和线程的CUDA数以获得最佳性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经根据经验对块和线程的几个值进行了测试,并且执行时间可以大大减少具体的值。

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.

我看不到块和线程之间的区别。我认为它可能是一个块中的线程具有特定的缓存内存,但它对我来说很模糊。目前,我在N个部分中并行化我的函数,这些函数在块/线程上分配。

I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.

我的目标是自动调整块的数量和线程关于我必须使用的内存大小。这是可能吗?谢谢。

My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.

推荐答案

Hong Zhou的回答很好,到目前为止。下面是一些更多的细节:

Hong Zhou's answer is good, so far. Here are some more details:

当使用共享内存时,你可能需要先考虑它,因为它是一个非常有限的资源,内核不太可能具有非常具体需要约束
这些许多控制并行性的变量。
您可以使用具有许多线程的块来共享更大的区域,或者使用更少的
线程共享更小的区域(在持续占用下)。

When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain those many variables controlling parallelism. You either have blocks with many threads sharing larger regions or blocks with fewer threads sharing smaller regions (under constant occupancy).

每个多处理器只能有16KB的共享内存
您可能希望选择更大的(48KB)L1缓存调用

If your code can live with as little as 16KB of shared memory per multiprocessor you might want to opt for larger (48KB) L1-caches calling

cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);

此外,可以使用编译器选项<$ c,禁用L1缓存用于非本地全局访问$ c> -Xptxas = -dlcm = cg ,以避免在内核仔细访问全局内存时造成污染。

Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.

占用你可能还需要检查
设备调试支持关闭CUDA> = 4.1(或给出适当的优化选项,请阅读我的帖子在这个线程适用于编译器
的配置)。

Before worrying about optimal performance based on occupancy you might also want to check that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler configuration).

现在我们有一个内存配置和寄存器实际上被积极使用,
我们可以分析不同占用率下的性能:

Now that we have a memory configuration and registers are actually used aggressively, we can analyze the performance under varying occupancy:

占用率(每个多处理器的warp数),多处理器将不得不等待(对于存储器事务或数据依赖),但是更多的线程必须共享相同的L1高速缓存,共享存储器区域和寄存器文件(参见CUDA优化指南以及此演示文稿)。

The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).

ABI可以为可变数量的寄存器生成代码(更多细节可以在我引用的线程中找到)。然而,在某些时候,发生寄存器溢出。这是寄存器值临时存储在(相对慢的,片外)本地存储器堆栈上。

The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.

在分析器中观察停顿原因,内存统计和算术吞吐量,而
改变启动边界和参数将有助于您找到合适的配置。

Watching stall reasons, memory statistics and arithmetic throughput in the profiler while varying the launch bounds and parameters will help you find a suitable configuration.

理论上可以从应用程序中找到最佳值,然而,
具有客户端代码对不同设备和启动参数进行最佳调整
可以是非平凡的,将需要重新编译或为每个目标设备架构部署的内核的不同变体。

It's theoretically possible to find optimal values from within an application, however, having the client code adjust optimally to both different device and launch parameters can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.

这篇关于如何调整块和线程的CUDA数以获得最佳性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆