对于GPU上与数据无关的问题,每个元素启动1个线程是否始终是最佳选择? [英] Is starting 1 thread per element always optimal for data independent problems on the GPU?

查看:51
本文介绍了对于GPU上与数据无关的问题,每个元素启动1个线程是否始终是最佳选择?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个简单的memcpy内核,以测量GTX 760M的内存带宽并将其与cudaMemcpy()进行比较。看起来像这样:

I was writing a simple memcpy kernel to meassure the memory bandwith of my GTX 760M and to compare it to cudaMemcpy(). It looks like that:

template<unsigned int THREADS_PER_BLOCK>
__global__ static
void copy(void* src, void* dest, unsigned int size) {
    using vector_type = int2;
    vector_type* src2 = reinterpret_cast<vector_type*>(src);
    vector_type* dest2 = reinterpret_cast<vector_type*>(dest);

    //This copy kernel is only correct when size%sizeof(vector_type)==0
    auto numElements = size / sizeof(vector_type);

    for(auto id = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; id < numElements ; id += gridDim.x * THREADS_PER_BLOCK){
        dest2[id] = src2[id];
    }
}

我还计算了达到100所需的块数占用率%如下:

I also calculated the number of blocks required to reach 100% occupancy like so:

THREADS_PER_BLOCK = 256 
Multi-Processors: 4 
Max Threads per Multi Processor: 2048 
NUM_BLOCKS = 4 * 2048 / 256 = 32

另一方面,我的测试如图所示,启动足够多的块,以便每个线程仅处理一个元素,其性能总是超过最佳块数。以下是400mb数据的时间安排:

My tests on the other hand showed, that starting enough blocks so that each thread only processes one element always outperformed the "optimal" block count. Here are the timings for 400mb of data:

bandwidth test by copying 400mb of data.
cudaMemcpy finished in 15.63ms. Bandwidth: 51.1838 GB/s
thrust::copy finished in 15.7218ms. Bandwidth: 50.8849 GB/s
my memcpy (195313 blocks) finished in 15.6208ms. Bandwidth: 51.2137 GB/s
my memcpy (32 blocks) finished in 16.8083ms. Bandwidth: 47.5956 GB/s

所以我的问题是:

为什么会有速度差异?

当每个元素可以完全独立于所有其他元素进行处理时,每个元素启动一个线程是否有不利之处? ?

Are there any downsides of starting one thread per element, when each element can be processed completely independent of all other elements?

推荐答案


每个元素开始1个线程是否总是最适合GPU上的数据独立问题?

Is starting 1 thread per element always optimal for data independent problems on the GPU?

并非总是如此。让我们考虑3种不同的实现方式。在每种情况下,我们都假设我们正在处理一个琐碎的可并行化问题,其中涉及每个线程一个元素加载,一些工作和一个元素存储。在您的复制示例中,基本上没有任何工作-只是加载和存储。

Not always. Let's consider 3 different implementations. In each case we'll assume we're dealing with a trivially parallelizable problem that involves one element load, some "work" and one element store per thread. In your copy example there is basically no work - just loads and stores.


  1. 每个线程一个元素。每个线程执行1个元素加载,工作和1个存储。 GPU喜欢每个可用线程具有很多公开的具有并行功能的指令,以隐藏延迟。您的示例由每个线程一个负载和一个存储组成,而忽略了其他指令(如索引算术等)。在示例GPU中,您有4个SM,每个SM最多可支持2048个线程(对于当今几乎所有GPU都是如此) ,因此最大飞行补码为8192个线程。因此最多可以将8192个负载发送到内存管道,然后我们要使机器停顿,直到数据从内存中返回,以便可以发出相应的存储指令。此外,在这种情况下,由于每个线程块只能处理256个元素,因此我们需要退出线程块并启动新的线程块。

  1. One element per thread. Each thread is doing 1 element load, the work, and 1 store. The GPU likes to have a lot of exposed parallel-issue-capable instructions per thread available, in order to hide latency. Your example consists of one load and one store per thread, ignoring other instructions like index arithmetic, etc. In your example GPU, you have 4 SMs, and each is capable of a maximum complement of 2048 threads (true for nearly all GPUs today), so the maximum in-flight complement is 8192 threads. So at most, 8192 loads can be issued to the memory pipe, then we're going to hit machine stalls until that data comes back from memory, so that the corresponding store instructions can be issued. In addition, for this case, we have overhead associated with retiring threadblocks and launching new threadblocks, since each block only handles 256 elements.

每个线程多个元素,在编译时未知。在这种情况下,我们有一个循环。编译器在编译时不知道循环范围,因此它可能会展开循环,也可能不会展开循环。如果它没有展开循环,则每个循环迭代都有一个加载,然后是一个存储。这并没有给编译器一个很好的机会来重新排序(独立的)指令,因此实际效果可能与情况1相同,除了我们在处理循环方面有一些额外的开销。

Multiple elements per thread, not known at compile time. In this case, we have a loop. The compiler does not know the loop extent at compile time, so it may or may not unroll the the loop. If it does not unroll the loop, then we have a load followed by a store per each loop iteration. This doesn't give the compiler a good opportunity to reorder (independent) instructions, so the net effect may be the same as case 1 except that we have some additional overhead associated with processing the loop.

每个线程多个元素,在编译时已知。您尚未真正提供此示例,但这通常是最佳方案。在parallelforall博客矩阵转置示例中,实际上,复制内核选择让每个线程执行复制工作的8个元素。然后,编译器将看到一个循环:

Multiple elements per thread, known at compile time. You haven't really provided this example, but it is often the best scenario. In the parallelforall blog matrix transpose example, the writer of that essentially copy kernel chose to have each thread perform 8 elements of copy "work". The compiler then sees a loop:

  LOOP:  LD R0, in[idx];
         ST out[idx], R0;
         ...
         BRA  LOOP;

它可以展开(比如说)8次:

which it can unroll (let's say) 8 times:

     LD R0, in[idx];
     ST out[idx], R0;
     LD R0, in[idx+1];
     ST out[idx+1], R0;
     LD R0, in[idx+2];
     ST out[idx+2], R0;
     LD R0, in[idx+3];
     ST out[idx+3], R0;
     LD R0, in[idx+4];
     ST out[idx+4], R0;
     LD R0, in[idx+5];
     ST out[idx+5], R0;
     LD R0, in[idx+6];
     ST out[idx+6], R0;
     LD R0, in[idx+7];
     ST out[idx+7], R0;

之后,由于操作是独立的,因此可以对指令重新排序:

and after that it can reorder the instructions, since the operations are independent:

     LD R0, in[idx];
     LD R1, in[idx+1];
     LD R2, in[idx+2];
     LD R3, in[idx+3];
     LD R4, in[idx+4];
     LD R5, in[idx+5];
     LD R6, in[idx+6];
     LD R7, in[idx+7];
     ST out[idx], R0;
     ST out[idx+1], R1;
     ST out[idx+2], R2;
     ST out[idx+3], R3;
     ST out[idx+4], R4;
     ST out[idx+5], R5;
     ST out[idx+6], R6;
     ST out[idx+7], R7;

是以增加注册压力为代价的。与非展开循环相比,这里的好处是可以发出前8条 LD 指令-它们都是独立的。发出这些指令后,线程将停在第一条 ST 指令中-直到实际从全局内存返回相应的数据为止。在非展开情况下,机器可以发出第一条 LD 指令,但立即命中从属的 ST 指令,因此它可能会停在那里。这样做的结果是,在前两种情况下,我只能进行8192个 LD 操作到内存子系统,但是在第三种情况下,我可以飞行中有65536个 LD 指令。这有好处吗?在某些情况下,确实如此。好处会因您所运行的GPU而异。

at the expense of some increased register pressure. The benefit here, as compared to the non-unrolled loop case, is that the first 8 LD instructions can all be issued - they are all independent. After issuing those, the thread will stall at the first ST instruction - until the corresponding data is actually returned from global memory. In the non-unrolled case, the machine can issue the first LD instruction, but immediately hits a dependent ST instruction, and so it may stall right there. The net of this is that in the first 2 scenarios, I was only able to have 8192 LD operations in flight to the memory subsystem, but in the 3rd case I was able to have 65536 LD instructions in flight. Does this provide a benefit? In some cases, it does. The benefit will vary depending on which GPU you are running on.

我们在这里所做的工作是有效的(有效的(与编译器一起使用)增加,在线程陷入停顿之前,每个线程可以发出的指令数。这也称为增加暴露的并行性,基本上是通过这种方法通过ILP实现的。它是否有好处,将取决于您的实际代码,实际的GPU以及当时GPU中的其他功能。但是,使用诸如此类的技术来增加暴露的并行性始终是一个好策略,因为发出指令的能力是GPU如何隐藏必须处理的各种形式的延迟,因此我们有效地提高了GPU的隐藏延迟的能力。

What we have done here, is effectively (working in conjunction with the compiler) increase the number of instructions that can be issued per thread, before the thread will hit a stall. This is also referred to as increasing the exposed parallelism, basically via ILP in this approach. Whether or not it has any benefit will vary depending on your actual code, your actual GPU, and what else is going in the GPU at that time. But it is always a good strategy to increase exposed parallelism using techniques such as this, because the ability to issue instructions is how the GPU hides the various forms of latency that it must deal with, so we have effectively improved the GPU's ability to hide latency, with this approach.


为什么会有速度差异?

Why is there a speed difference?

如果不仔细分析代码,可能很难回答。但是,通常情况下,启动足够个线程以完全满足GPU的瞬时承载能力并不是一个好的策略,这可能是由于尾效应或其他类型的效率低下所致。块受某些其他因素(例如寄存器或共享内存使用率)的限制也可能是这种情况。通常有必要仔细分析并研究生成的机器代码,以完全回答此类问题。但是,循环开销可能会明显影响您的比较,这基本上就是我的情况2与上面的情况1。

This can be difficult to answer without profiling the code carefully. However it's often the case that launching just enough threads to fully satisfy the instantaneous carrying capacity of the GPU is not a good strategy, possibly due to the "tail effect" or other types of inefficiency. It may also be the case that blocks are limited by some other factor, such as registers or shared memory usage. It's usually necessary to carefully profile as well as possibly study the generated machine code to fully answer such a question. But it may be that the loop overhead measurably impacts your comparison, which is basically my case 2 vs. my case 1 above.

(请注意我的伪中的内存索引机器代码示例不是编写良好的网格跨越式复制循环所期望的-它们只是出于示例目的,以展示展开功能以及通过编译器指令重新排序可以带来的好处)。

(note the memory indices in my "pseudo" machine code example are not what you would expect for a well written grid-striding copy loop - they are just for example purposes to demonstrate unrolling and the benefit it can have via compiler instruction reordering).

这篇关于对于GPU上与数据无关的问题,每个元素启动1个线程是否始终是最佳选择?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆