为什么要在CUDA中启动32个线程中的多个线程? [英] Why launch a multiple of 32 number of threads in CUDA?

查看:83
本文介绍了为什么要在CUDA中启动32个线程中的多个线程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我参加了CUDA并行编程课程,并且看到了许多CUDA线程配置的示例,通常将所需的线程数四舍五入为最接近的32的倍数.我知道线程被分组为warp,并且那如果您启动1000个线程,GPU仍将它向上舍入为1024,那么为什么要显式地这样做呢?

I took a course in CUDA parallel programming and I have seen many examples of CUDA thread configuration where it is common to round up the number of threads needed to the closest multiple of 32. I understand that threads are grouped into warps, and that if you launch 1000 threads, the GPU will round it up to 1024 anyways, so why do it explicitly then?

推荐答案

通常在可能会选择各种线程块大小来解决同一问题的情况下提供此建议.

The advice is generally given in the context of situations where you might conceivably choose various threadblock sizes to solve the same problem.

让我们以矢量添加为例.假设我的向量的长度为100000.我可以选择通过启动100个包含1000个线程的块来执行此操作.在这种情况下,每个块将具有1000个活动线程和24个非活动线程.我对线程资源的平均利用率为1000/1024 = 97.6%.

Let's take vector add as an example. Suppose my vector is of length 100000. I might choose to do this by launching 100 blocks of 1000 threads each. In this case, each block will have 1000 active threads, and 24 inactive threads. My average utilization of thread resources is 1000/1024 = 97.6%.

现在,如果我选择大小为1024的块怎么办?现在,我只需要启动98个街区.这些块中的前97个在线程利用率方面得到了充分利用-每个线程都在做一些有用的事情.第98个块只有672个(1024个线程)正在执行有用的操作.其他线程由于线程检查( if(idx< N))或内核代码中的其他构造而明确地处于非活动状态.所以我在那一个区块中有352个不活动线程.但是我的总体平均利用率是100000/100352 = 99.6%

Now, what if I chose blocks of size 1024? Now I only need to launch 98 blocks. The first 97 of these blocks are fully utilized in terms of thread utilization - every thread is doing some thing useful. The 98th block only has 672 (out of 1024) threads that are doing something useful. The others are explicitly inactive because of a thread check (if (idx < N) ) or other construct in the kernel code. So I have 352 inactive threads in that one block. But my overall average utilization is 100000/100352 = 99.6%

因此,在上述情况下,最好选择完整"线程块,该线程块可以被32整除.

So in the above scenario, it's better to choose "full" threadblocks, evenly divisible by 32.

如果您正在执行向量加长为1000的向量,并且打算在单个线程块中进行(这可能都是不好的主意),那么为线程块大小选择1000还是1024都没关系

If you are doing vector add on a vector of length 1000, and you intend to do it in a single threadblock, (both may be bad ideas), then it does not matter whether you choose 1000 or 1024 for your threadblock size.

这篇关于为什么要在CUDA中启动32个线程中的多个线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆