对于 GPU 上的数据独立问题,每个元素启动 1 个线程是否总是最佳的? [英] Is starting 1 thread per element always optimal for data independent problems on the GPU?

查看:16
本文介绍了对于 GPU 上的数据独立问题,每个元素启动 1 个线程是否总是最佳的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个简单的 memcpy 内核来测量我的 GTX 760M 的内存带宽并将其与 cudaMemcpy() 进行比较.看起来是这样的:

I was writing a simple memcpy kernel to meassure the memory bandwith of my GTX 760M and to compare it to cudaMemcpy(). It looks like that:

template<unsigned int THREADS_PER_BLOCK>
__global__ static
void copy(void* src, void* dest, unsigned int size) {
    using vector_type = int2;
    vector_type* src2 = reinterpret_cast<vector_type*>(src);
    vector_type* dest2 = reinterpret_cast<vector_type*>(dest);

    //This copy kernel is only correct when size%sizeof(vector_type)==0
    auto numElements = size / sizeof(vector_type);

    for(auto id = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; id < numElements ; id += gridDim.x * THREADS_PER_BLOCK){
        dest2[id] = src2[id];
    }
}

我还计算了达到 100% 占用率所需的块数,如下所示:

I also calculated the number of blocks required to reach 100% occupancy like so:

THREADS_PER_BLOCK = 256 
Multi-Processors: 4 
Max Threads per Multi Processor: 2048 
NUM_BLOCKS = 4 * 2048 / 256 = 32

另一方面,我的测试表明,启动足够多的块以使每个线程只处理一个元素总是优于最佳"块数.以下是 400mb 数据的时间安排:

My tests on the other hand showed, that starting enough blocks so that each thread only processes one element always outperformed the "optimal" block count. Here are the timings for 400mb of data:

bandwidth test by copying 400mb of data.
cudaMemcpy finished in 15.63ms. Bandwidth: 51.1838 GB/s
thrust::copy finished in 15.7218ms. Bandwidth: 50.8849 GB/s
my memcpy (195313 blocks) finished in 15.6208ms. Bandwidth: 51.2137 GB/s
my memcpy (32 blocks) finished in 16.8083ms. Bandwidth: 47.5956 GB/s

所以我的问题是:

为什么会有速度差异?

当每个元素可以完全独立于所有其他元素进行处理时,每个元素启动一个线程有什么缺点吗?

Are there any downsides of starting one thread per element, when each element can be processed completely independent of all other elements?

推荐答案

对于 GPU 上的数据独立问题,每个元素启动 1 个线程是否总是最佳的?

Is starting 1 thread per element always optimal for data independent problems on the GPU?

并非总是如此.让我们考虑 3 种不同的实现.在每种情况下,我们都假设我们正在处理一个微不足道的可并行化问题,该问题涉及一个元素加载、一些工作"和每个线程一个元素存储.在您的复制示例中,基本上没有工作 - 只是加载和存储.

Not always. Let's consider 3 different implementations. In each case we'll assume we're dealing with a trivially parallelizable problem that involves one element load, some "work" and one element store per thread. In your copy example there is basically no work - just loads and stores.

  1. 每个线程一个元素.每个线程执行 1 个元素加载、工作和 1 个存储.GPU 喜欢在每个可用线程中公开大量支持并行问题的指令,以隐藏延迟.您的示例由每个线程一次加载和一次存储组成,忽略索引算术等其他指令.在您的示例 GPU 中,您有 4 个 SM,每个 SM 最多能够补充 2048 个线程(今天几乎所有 GPU 都是如此),所以最大的动态补码是 8192 个线程.因此,最多可以向内存管道发出 8192 次加载,然后我们将遇到机器停顿,直到该数据从内存中返回,以便可以发出相应的存储指令.此外,在这种情况下,我们有与停用线程块和启动新线程块相关的开销,因为每个块仅处理 256 个元素.

  1. One element per thread. Each thread is doing 1 element load, the work, and 1 store. The GPU likes to have a lot of exposed parallel-issue-capable instructions per thread available, in order to hide latency. Your example consists of one load and one store per thread, ignoring other instructions like index arithmetic, etc. In your example GPU, you have 4 SMs, and each is capable of a maximum complement of 2048 threads (true for nearly all GPUs today), so the maximum in-flight complement is 8192 threads. So at most, 8192 loads can be issued to the memory pipe, then we're going to hit machine stalls until that data comes back from memory, so that the corresponding store instructions can be issued. In addition, for this case, we have overhead associated with retiring threadblocks and launching new threadblocks, since each block only handles 256 elements.

每个线程有多个元素,编译时未知.在这种情况下,我们有一个循环.编译器在编译时不知道循环范围,因此它可能会也可能不会展开循环.如果它没有展开循环,那么每次循环迭代我们都会有一个加载,然后是一个存储.这并没有给编译器一个重新排序(独立)指令的好机会,因此净效果可能与案例 1 相同,只是我们有一些与处理循环相关的额外开销.

Multiple elements per thread, not known at compile time. In this case, we have a loop. The compiler does not know the loop extent at compile time, so it may or may not unroll the the loop. If it does not unroll the loop, then we have a load followed by a store per each loop iteration. This doesn't give the compiler a good opportunity to reorder (independent) instructions, so the net effect may be the same as case 1 except that we have some additional overhead associated with processing the loop.

每个线程的多个元素,在编译时已知.您还没有真正提供此示例,但它通常是最好的方案.在parallelforall 博客矩阵转置示例中,该作者本质上,复制内核选择让每个线程执行复制工作"的 8 个元素.然后编译器会看到一个循环:

Multiple elements per thread, known at compile time. You haven't really provided this example, but it is often the best scenario. In the parallelforall blog matrix transpose example, the writer of that essentially copy kernel chose to have each thread perform 8 elements of copy "work". The compiler then sees a loop:

  LOOP:  LD R0, in[idx];
         ST out[idx], R0;
         ...
         BRA  LOOP;

它可以展开(比如说)8 次:

which it can unroll (let's say) 8 times:

     LD R0, in[idx];
     ST out[idx], R0;
     LD R0, in[idx+1];
     ST out[idx+1], R0;
     LD R0, in[idx+2];
     ST out[idx+2], R0;
     LD R0, in[idx+3];
     ST out[idx+3], R0;
     LD R0, in[idx+4];
     ST out[idx+4], R0;
     LD R0, in[idx+5];
     ST out[idx+5], R0;
     LD R0, in[idx+6];
     ST out[idx+6], R0;
     LD R0, in[idx+7];
     ST out[idx+7], R0;

然后它可以重新排序指令,因为操作是独立的:

and after that it can reorder the instructions, since the operations are independent:

     LD R0, in[idx];
     LD R1, in[idx+1];
     LD R2, in[idx+2];
     LD R3, in[idx+3];
     LD R4, in[idx+4];
     LD R5, in[idx+5];
     LD R6, in[idx+6];
     LD R7, in[idx+7];
     ST out[idx], R0;
     ST out[idx+1], R1;
     ST out[idx+2], R2;
     ST out[idx+3], R3;
     ST out[idx+4], R4;
     ST out[idx+5], R5;
     ST out[idx+6], R6;
     ST out[idx+7], R7;

以增加一些注册压力为代价.与非展开循环的情况相比,这里的好处是前 8 个 LD 指令都可以发出 - 它们都是独立的.发出这些指令后,线程将在第一个 ST 指令处停止 - 直到相应的数据实际从全局内存返回.在未展开的情况下,机器可以发出第一条 LD 指令,但会立即命中相关的 ST 指令,因此它可能会停在那里.这样做的结果是,在前两种情况下,我只能有 8192 个 LD 操作在运行到内存子系统,但在第三种情况下,我能够有 65536 个 LD 飞行中的说明.这有好处吗?在某些情况下,确实如此.好处取决于您在哪个 GPU 上运行.

at the expense of some increased register pressure. The benefit here, as compared to the non-unrolled loop case, is that the first 8 LD instructions can all be issued - they are all independent. After issuing those, the thread will stall at the first ST instruction - until the corresponding data is actually returned from global memory. In the non-unrolled case, the machine can issue the first LD instruction, but immediately hits a dependent ST instruction, and so it may stall right there. The net of this is that in the first 2 scenarios, I was only able to have 8192 LD operations in flight to the memory subsystem, but in the 3rd case I was able to have 65536 LD instructions in flight. Does this provide a benefit? In some cases, it does. The benefit will vary depending on which GPU you are running on.

我们在这里所做的是有效地(与编译器一起工作)增加每个线程可以发出的指令数,然后线程就会停止.这也称为增加暴露并行性,在这种方法中基本上是通过 ILP.它是否有任何好处将取决于您的实际代码、您的实际 GPU 以及当时 GPU 中的其他内容.但是使用这样的技术来增加暴露的并行性总是一个很好的策略,因为发出指令的能力是 GPU 隐藏它必须处理的各种形式的延迟的方式,所以我们有效地提高了 GPU 隐藏延迟的能力, 用这种方法.

What we have done here, is effectively (working in conjunction with the compiler) increase the number of instructions that can be issued per thread, before the thread will hit a stall. This is also referred to as increasing the exposed parallelism, basically via ILP in this approach. Whether or not it has any benefit will vary depending on your actual code, your actual GPU, and what else is going in the GPU at that time. But it is always a good strategy to increase exposed parallelism using techniques such as this, because the ability to issue instructions is how the GPU hides the various forms of latency that it must deal with, so we have effectively improved the GPU's ability to hide latency, with this approach.

为什么会有速度差异?

如果不仔细分析代码,可能很难回答这个问题.然而,启动足够线程来完全满足 GPU 的瞬时承载能力通常不是一个好的策略,这可能是由于尾部效应"或其他类型的低效率.块也可能受到其他因素的限制,例如寄存器或共享内存使用.通常需要仔细分析并可能研究生成的机器代码才能完全回答这样的问题.但循环开销可能会显着影响您的比较,这基本上是我的案例 2 与我上面的案例 1.

This can be difficult to answer without profiling the code carefully. However it's often the case that launching just enough threads to fully satisfy the instantaneous carrying capacity of the GPU is not a good strategy, possibly due to the "tail effect" or other types of inefficiency. It may also be the case that blocks are limited by some other factor, such as registers or shared memory usage. It's usually necessary to carefully profile as well as possibly study the generated machine code to fully answer such a question. But it may be that the loop overhead measurably impacts your comparison, which is basically my case 2 vs. my case 1 above.

(请注意,我的伪"机器代码示例中的内存索引并不是您对编写良好的网格跨步复制循环所期望的 - 它们只是为了演示展开以及它可以通过编译器指令获得的好处重新排序).

(note the memory indices in my "pseudo" machine code example are not what you would expect for a well written grid-striding copy loop - they are just for example purposes to demonstrate unrolling and the benefit it can have via compiler instruction reordering).

这篇关于对于 GPU 上的数据独立问题,每个元素启动 1 个线程是否总是最佳的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆