我应该检查内核代码中的线程数吗? [英] Should I check the number of threads in kernel code?

查看:71
本文介绍了我应该检查内核代码中的线程数吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是CUDA的初学者,我的同事们总是使用以下包装设计内核:

I am a beginner with CUDA, and my coworkers always design kernels with the following wrapping:

__global__ void myKernel(int nbThreads)
{
    int threadId = blockDim.x*blockIdx.y*gridDim.x  //rows preceeding current row in grid
            + blockDim.x*blockIdx.x             //blocks preceeding current block
            + threadIdx.x;

    if (threadId < nbThreads)
    {
        statement();
        statement();
        statement();
    }
}

他们认为在某些情况下CUDA可能会启动更多为了对齐/翘曲,线程数比指定的多,因此我们需要每次对其进行检查。
但是,到目前为止,我在互联网上还没有看到可以实际进行此验证的示例内核。

They think there are some situations where CUDA might launch more threads than specified for alignment/warping sake, so we need to check it every time. However, I've seen no example kernel on the internet so far where they actually do this verification.

CUDA能否实际启动比指定块/更多的线程/

Can CUDA actually launch more threads than specified block/grid dimensions?

推荐答案

CUDA不会启动比块/网格尺寸指定的线程更多的线程。

CUDA will not launch more threads than what are specified by the block/grid dimensions.

但是,由于块尺寸的粒度(例如,希望块尺寸为32的倍数,并且大小限制为1024或512),通常是这种情况

However, due to the granularity of block dimensions (e.g. it's desirable to have block dimensions be a multiple of 32, and it is limited in size to 1024 or 512), it is frequently the case that it is difficult to match a grid of threads to be numerically equal to the desired problem size.

在这些情况下,典型的行为是启动更多的线程,有效地舍入根据块粒度将其扩展到下一个偶数大小,并使用内核中的线程检查代码来确保额外线程(即超出问题大小的那些线程)不执行任何操作。

In these cases, the typical behavior is to launch more threads, effectively rounding up to the next even size based on the block granularity, and use the "thread check" code in the kernel to make sure that the "extra threads", i.e. those beyond the problem size, don't do anything.

在您的示例中,这可以澄清b y写作:

In your example, this could be clarified by writing:

__global__ void myKernel(int problem_size)


    if (threadId < problem_size)

传达意图,即仅与问题大小相对应的线程(可能不匹配)

which communicates what is intended, that only threads corresponding to the problem size (which may not match the launched grid size) do any actual work.

作为一个非常简单的示例,假设我想对长度为10000个元素的向量进行向量加法。 10000不是32的倍数,也不小于1024,因此在典型的实现中,我将启动多个线程块来完成工作。

As a very simple example, suppose I wanted to do a vector add, on a vector whose length was 10000 elements. 10000 is not a multiple of 32, nor is it less than 1024, so in a typical implementation I would launch multiple threadblocks to do the work.

如果我想要每个线程块如果是32的倍数,则没有可供选择的线程块数,这将为我提供10000个线程。因此,我可能会在一个线程块中选择256个线程,并启动40个线程块,总共给我10240个线程。使用线程检查,可以防止多余的 240个线程执行任何操作。

If I want each threadblock to be a multiple of 32, there is no number of threadblocks that I can choose which will give me exactly 10000 threads. Therefore, I might choose 256 threads in a threadblock, and launch 40 threadblocks, giving me 10240 threads total. Using the thread check, I prevent the "extra" 240 threads from doing anything.

这篇关于我应该检查内核代码中的线程数吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆