CUDA内核融合如何提高GPU上的内存绑定应用程序的性能? [英] CUDA How Does Kernel Fusion Improve Performance on Memory Bound Applications on the GPU?

查看:205
本文介绍了CUDA内核融合如何提高GPU上的内存绑定应用程序的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究流数据集,该数据集大于GPU到设备上用于基本计算的可用内存。主要限制之一是以下事实:PCIe总线通常限制在8GB / s左右,并且内核融合可以帮助重用可以重用的数据,并且可以利用GPU中的共享内存和局部性。我发现的大多数研究论文都很难理解,而且大多数都在复杂的应用程序中实现融合,例如 https: //ieeexplore.ieee.org/document/6270615 。我读过许多论文,但都无法解释将两个内核融合在一起的一些简单步骤。



我的问题是融合实际上是如何工作的?强>。将普通内核更改为融合内核需要执行哪些步骤?此外,是否有必要使用多个内核来进行融合,因为融合只是消除某些内存绑定问题以及利用局部性和共享内存的一个很好的名词。



我需要了解如何将内核融合用于基本的CUDA程序,例如矩阵乘法或加减内核。一个非常简单的示例(代码不正确,但应该给出一个想法),例如:

  int * device_A; 
int * device_B;
int * device_C;

cudaMalloc(device_A,sizeof(int)* N);

cudaMemcpyAsync(device_A,host_A,N * sizeof(int),HostToDevice,stream);

KernelAdd<<<<<块,线程,流>>>(device_A,device_B); //将结果放入C
KernelSubtract<<<<< block,thread,stream>>>>(device_C);

cudaMemcpyAsync(host_C,device_C,N * sizeof(int),DeviceToHost,stream); //通过PCIe将最终结果发送到CPU


解决方案

内核融合的基本思想是将2个或更多内核转换为1个内核。操作被合并。最初的好处可能并不明显。但是它可以提供两种相关的好处:


  1. 通过重用内核可能已在寄存器或共享内存中填充的数据 li>
  2. 通过减少(即消除)多余的负载和存储

让我们使用像您这样的示例,其中我们有一个加法内核和一个乘法内核,并假设每个内核都可以在向量上工作,并且每个线程都执行以下操作:


  1. 从全局内存加载矢量A的元素

  2. 向我的矢量元素添加常量或乘以常量

  3. 将元素存储回去到向量A(在全局内存中)

此操作需要每个线程读取一次,每个线程写入一次。如果我们都背靠背地完成了这两个操作,则操作序列将如下所示:



添加内核:


  1. 从全局内存中加载向量A的元素

  2. 向向量元素添加值

  3. 存储我的向量元素返回到向量A(在全局内存中)

乘以内核:


  1. 从全局内存中加载向量A的元素

  2. 将向量元素乘以一个值

  3. 存储我的元素回到向量A(在全局内存中)

我们可以看到第一个内核中的步骤3和第二个内核中的步骤1所做的事情实际上并不需要达到最终结果,但是由于这些(独立)内核的设计,它们是必需的。除了通过全局内存,没有一个内核可以将结果传递给另一个内核。



但是如果将两个内核结合在一起,我们可以编写这样的内核:


  1. 从全局内存中加载向量A的元素

  2. 向向量元素添加值

  3. 将我的矢量元素乘以一个值

  4. 将我的元素存储回矢量A(在全局内存中)

此融合内核同时执行两个操作,产生相同的结果,但代替2个全局内存加载操作和2个全局内存存储操作,它仅需要1个。

对于GPU上的内存绑定操作(如此类操作),这种节省可能非常可观。通过减少所需的装载和存储数量,总体性能得到了改善,通常与装载/存储操作数量的减少成比例。


I've been conducting research on streaming datasets larger than the memory available on the GPU to the device for basic computations. One of the main limitations is the fact that the PCIe bus is generally limited around 8GB/s, and kernel fusion can help reuse data that can be reused and that it can exploit shared memory and locality within the GPU. Most research papers I have found are very difficult to understand and most of them implement fusion in complex applications such as https://ieeexplore.ieee.org/document/6270615 . I've read many papers and they ALL FAIL TO EXPLAIN some simple steps to fuse two kernels together.

My question is how does fusion actually work?. What are the steps one would go through to change a normal kernel to a fused kernel? Also, is it necessary to have more than one kernel in order to fuse it, as fusing is just a fancy term for eliminating some memory bound issues, and exploiting locality and shared memory.

I need to understand how kernel fusion is used for a basic CUDA program, like matrix multiplication, or addition and subtraction kernels. A really simple example (The code is not correct but should give an idea) like:

int *device_A;
int *device_B;
int *device_C;

cudaMalloc(device_A,sizeof(int)*N);

cudaMemcpyAsync(device_A,host_A, N*sizeof(int),HostToDevice,stream);

KernelAdd<<<block,thread,stream>>>(device_A,device_B); //put result in C
KernelSubtract<<<block,thread,stream>>>(device_C);

cudaMemcpyAsync(host_C,device_C, N*sizeof(int),DeviceToHost,stream); //send final result through the PCIe to the CPU

解决方案

The basic idea behind kernel fusion is that 2 or more kernels will be converted into 1 kernel. The operations are combined. Initially it may not be obvious what the benefit is. But it can provide two related kinds of benefits:

  1. by reusing the data that a kernel may have populated either in registers or shared memory
  2. by reducing (i.e. eliminating) "redundant" loads and stores

Let's use an example like yours, where we have an Add kernel and a multiply kernel, and assume each kernel works on a vector, and each thread does the following:

  1. Load my element of vector A from global memory
  2. Add a constant to, or multiply by a constant, my vector element
  3. Store my element back out to vector A (in global memory)

This operation requires one read per thread and one write per thread. If we did both of them back-to-back, the sequence of operations would look like:

Add kernel:

  1. Load my element of vector A from global memory
  2. Add a value to my vector element
  3. Store my element back out to vector A (in global memory)

Multiply kernel:

  1. Load my element of vector A from global memory
  2. Multiply my vector element by a value
  3. Store my element back out to vector A (in global memory)

We can see that step 3 in the first kernel and step 1 in the second kernel are doing things that aren't really necessary to achieve the final result, but they are necessary due to the design of these (independent) kernels. There is no way for one kernel to pass results to another kernel except via global memory.

But if we combine the two kernels together, we could write a kernel like this:

  1. Load my element of vector A from global memory
  2. Add a value to my vector element
  3. Multiply my vector element by a value
  4. Store my element back out to vector A (in global memory)

This fused kernel does both operations, produces the same result, but instead of 2 global memory load operations and 2 global memory store operations, it only requires 1 of each.

This savings can be very significant for memory-bound operations (like these) on the GPU. By reducing the number of loads and stores required, the overall performance is improved, usually proportional to the reduction in number of load/store operations.

这篇关于CUDA内核融合如何提高GPU上的内存绑定应用程序的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆