CUDA 内核融合如何提高 GPU 上的内存绑定应用程序的性能? [英] CUDA How Does Kernel Fusion Improve Performance on Memory Bound Applications on the GPU?

查看:13
本文介绍了CUDA 内核融合如何提高 GPU 上的内存绑定应用程序的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究比 GPU 上用于基本计算的设备可用内存更大的流数据集.主要限制之一是 PCIe 总线通常限制在 8GB/s 左右,内核融合可以帮助重用可重用的数据,并且它可以利用 GPU 内的共享内存和局部性.我发现的大多数研究论文都很难理解,并且大多数都在复杂的应用程序中实现了融合,例如 https://ieeexplore.ieee.org/document/6270615.我读过很多论文,但都未能解释一些将两个内核融合在一起的简单步骤.

I've been conducting research on streaming datasets larger than the memory available on the GPU to the device for basic computations. One of the main limitations is the fact that the PCIe bus is generally limited around 8GB/s, and kernel fusion can help reuse data that can be reused and that it can exploit shared memory and locality within the GPU. Most research papers I have found are very difficult to understand and most of them implement fusion in complex applications such as https://ieeexplore.ieee.org/document/6270615 . I've read many papers and they ALL FAIL TO EXPLAIN some simple steps to fuse two kernels together.

我的问题是融合实际上是如何工作的?.将普通内核更改为融合内核的步骤是什么?此外,是否需要多个内核才能融合它,因为融合只是一个花哨的术语,用于消除一些内存绑定问题,并利用局部性和共享内存.

My question is how does fusion actually work?. What are the steps one would go through to change a normal kernel to a fused kernel? Also, is it necessary to have more than one kernel in order to fuse it, as fusing is just a fancy term for eliminating some memory bound issues, and exploiting locality and shared memory.

我需要了解如何将内核融合用于基本 CUDA 程序,例如矩阵乘法或加法和减法内核.一个非常简单的例子(代码不正确,但应该给出一个想法)如:

I need to understand how kernel fusion is used for a basic CUDA program, like matrix multiplication, or addition and subtraction kernels. A really simple example (The code is not correct but should give an idea) like:

int *device_A;
int *device_B;
int *device_C;

cudaMalloc(device_A,sizeof(int)*N);

cudaMemcpyAsync(device_A,host_A, N*sizeof(int),HostToDevice,stream);

KernelAdd<<<block,thread,stream>>>(device_A,device_B); //put result in C
KernelSubtract<<<block,thread,stream>>>(device_C);

cudaMemcpyAsync(host_C,device_C, N*sizeof(int),DeviceToHost,stream); //send final result through the PCIe to the CPU

推荐答案

内核融合背后的基本思想是将2个或更多内核转换为1个内核.操作是合并的.最初可能并不明显有什么好处.但它可以提供两种相关的好处:

The basic idea behind kernel fusion is that 2 or more kernels will be converted into 1 kernel. The operations are combined. Initially it may not be obvious what the benefit is. But it can provide two related kinds of benefits:

  1. 通过重用内核可能在寄存器或共享内存中填充的数据
  2. 通过减少(即消除)冗余"负载和存储

让我们使用像您这样的示例,其中我们有一个 Add kernel 和一个 multiply kernel,并假设每个内核都在一个向量上工作,并且每个线程执行以下操作:

Let's use an example like yours, where we have an Add kernel and a multiply kernel, and assume each kernel works on a vector, and each thread does the following:

  1. 从全局内存中加载向量 A 的元素
  2. 向我的向量元素添加一个常量或乘以一个常量
  3. 将我的元素存储回向量 A(在全局内存中)

此操作要求每个线程读取一次,每个线程写入一次.如果我们两个都背靠背进行,操作顺序如下:

This operation requires one read per thread and one write per thread. If we did both of them back-to-back, the sequence of operations would look like:

添加内核:

  1. 从全局内存中加载向量 A 的元素
  2. 为我的向量元素添加一个值
  3. 将我的元素存储回向量 A(在全局内存中)

乘核:

  1. 从全局内存中加载向量 A 的元素
  2. 将我的向量元素乘以一个值
  3. 将我的元素存储回向量 A(在全局内存中)

我们可以看到,第一个内核中的第 3 步和第二个内核中的第 1 步所做的事情对于实现最终结果并不是真正必要的,但由于这些(独立)内核的设计,它们是必要的.一个内核无法将结果传递给另一个内核,除非通过全局内存.

We can see that step 3 in the first kernel and step 1 in the second kernel are doing things that aren't really necessary to achieve the final result, but they are necessary due to the design of these (independent) kernels. There is no way for one kernel to pass results to another kernel except via global memory.

但是如果我们将两个内核组合在一起,我们可以这样写一个内核:

But if we combine the two kernels together, we could write a kernel like this:

  1. 从全局内存中加载向量 A 的元素
  2. 为我的向量元素添加一个值
  3. 将我的向量元素乘以一个值
  4. 将我的元素存储回向量 A(在全局内存中)

这个融合内核执行这两个操作,产生相同的结果,但不是 2 个全局内存加载操作和 2 个全局内存存储操作,它只需要每个操作 1 个.

This fused kernel does both operations, produces the same result, but instead of 2 global memory load operations and 2 global memory store operations, it only requires 1 of each.

这种节省对于 GPU 上的内存绑定操作(例如这些)来说非常重要.通过减少所需的加载和存储次数,可以提高整体性能,通常与加载/存储操作次数的减少成正比.

This savings can be very significant for memory-bound operations (like these) on the GPU. By reducing the number of loads and stores required, the overall performance is improved, usually proportional to the reduction in number of load/store operations.

这篇关于CUDA 内核融合如何提高 GPU 上的内存绑定应用程序的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆