CUDA 分析器报告全局内存访问效率低下 [英] CUDA profiler reports inefficient global memory access

查看：18 发布时间：2022/1/10 16:10:04 caching memory cuda profiler

本文介绍了CUDA 分析器报告全局内存访问效率低下的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个简单的 CUDA 内核，我认为它可以有效地访问全局内存.然而，Nvidia 分析器报告说我正在执行低效的全局内存访问.我的内核代码是:

I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses. My kernel code is:

__global__ void update_particles_kernel
(
    float4 *pos, 
    float4 *vel, 
    float4 *acc, 
    float dt, 
    int numParticles
)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
int offset = 0;

while(index + offset < numParticles)
{
    vel[index + offset].x += dt*acc[index + offset].x;   // line 247
    vel[index + offset].y += dt*acc[index + offset].y;
    vel[index + offset].z += dt*acc[index + offset].z;

    pos[index + offset].x += dt*vel[index + offset].x;   // line 251
    pos[index + offset].y += dt*vel[index + offset].y;
    pos[index + offset].z += dt*vel[index + offset].z;

    offset += blockDim.x * gridDim.x;
}

特别是分析器报告以下内容:

In particular the profiler reports the following:

来自 CUDA 最佳实践指南它说:

From the CUDA best practices guide it says:

对于计算能力为 2.x 的设备，可以很容易地总结出要求:warp 线程的并发访问将合并为数量等于服务所需的缓存行数量的事务warp 的所有线程.默认情况下，所有访问都通过 L1 缓存，其为 128 字节的行.对于分散的访问模式，为了减少过度读取，有时只缓存在 L2 中会很有用，它缓存较短的 32-字节段(参见 CUDA C 编程指南).

对于计算能力 3.x 的设备，对全局内存的访问仅缓存在 L2 中；L1 保留用于本地内存访问.某些计算能力为 3.5、3.7 或 5.2 的设备也允许在 L1 中选择加入缓存全局变量."

现在在我的内核中，基于此信息，我预计需要 16 次访问才能为 32 线程扭曲提供服务，因为 float4 是 16 字节，并且在我的卡(770m 计算能力 3.0)上，从 L2 缓存读取是在 32 中执行的字节块(16 字节 * 32 线程/32 字节缓存线 = 16 次访问).确实，正如您所看到的分析器报告我正在进行 16 次访问.我不明白为什么分析器报告理想的访问将涉及 247 行的每次访问 8 个 L2 事务，而其余行的每次访问只有 4 个 L2 事务.有人可以解释一下我在这里缺少什么吗?

Now in my kernel based on this information I would expect that 16 accesses would be required to service a 32 thread warp because float4 is 16 bytes and on my card (770m compute capability 3.0) reads from the L2 cache are performed in 32 bytes chunks (16 bytes * 32 threads / 32 bytes cache lines = 16 accesses). Indeed as you can see the profiler reports that I am doing 16 access. What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247 and only 4 L2 transactions per access for the remaining lines. Can someone explain what I am missing here?

推荐答案

我有一个简单的 CUDA 内核，我认为它可以有效地访问全局内存.然而，Nvidia 分析器报告说我正在执行低效的全局内存访问.

I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses.

举个例子，你的 float4 vel 数组是这样存储在内存中的:

To take one example, your float4 vel array is stored in memory like this:

0.x 0.y 0.z 0.w 1.x 1.y 1.z 1.w 2.x 2.y 2.z 2.w 3.x 3.y 3.z 3.w ...
  ^               ^               ^               ^             ...
  thread0         thread1         thread2         thread3

所以当你这样做时:

vel[index + offset].x += ...;   // line 247

您正在访问(存储)我在上面标记的位置 (.x).每个 ^ 标记之间的间隙表示低效的访问模式，分析器指出了这一点.(在下一行代码中，您将存储到 .y 位置并不重要.)

you are accessing (storing) at the locations (.x) that I have marked above. The gaps in between each ^ mark indicate an inefficient access pattern, which the profiler is pointing out. (It does not matter that in the very next line of code, you are storing to the .y locations.)

至少有 2 种解决方案，其中一种是经典的 AoS -> SoA 重组数据，并进行适当的代码调整.这是有据可查的(例如这里在 cuda 标签和其他地方)的含义，以及如何做到这一点，所以我会让你查一下.

There are at least 2 solutions, one of which would be a classical AoS -> SoA reorganization of your data, with appropriate code adjustments. This is well documented (e.g. here on the cuda tag and elsewhere) in terms of what it means, and how to do it, so I will let you look that up.

另一个典型的解决方案是在需要时为每个线程加载一个 float4 数量，并在需要时为每个线程存储一个 float4 数量.可以对您的代码进行微不足道的修改来执行此操作，这应该会改进分析结果:

The other typical solution is to load a float4 quantity per thread, when you need it, and store a float4 quantity per thread, when you need to. Your code can be trivially reworked to do this, which should give improved profiling results:

//preceding code need not change
while(index + offset < numParticles)
{
    float4 my_vel = vel[index + offset];
    float4 my_acc = acc[index + offset];
    my_vel.x += dt*my_acc.x;   
    my_vel.y += dt*my_acc.y;
    my_vel.z += dt*my_acc.z;
    vel[index + offset] = my_vel;

    float4 my_pos = pos[index + offset];
    my_pos.x += dt*my_vel.x; 
    my_pos.y += dt*my_vel.y;
    my_pos.z += dt*my_vel.z;
    pos[index + offset] = my_pos;

    offset += blockDim.x * gridDim.x;
}

即使您可能认为此代码比您的代码效率低"，因为您的代码似乎"只是加载和存储 .x、.y, .z，而我的似乎"也加载和存储 .w，事实上，由于 GPU 加载和存储的方式，本质上没有区别/来自全局内存.虽然您的代码似乎没有接触到 .w，但在访问相邻元素的过程中，GPU 会从全局内存中加载 .w 元素，并且(最终) 将 .w 元素存储回全局内存.

Even though you might think that this code is "less efficient" than your code, because your code "appears" to be only loading and storing .x, .y, .z, whereas mine "appears" to also load and store .w, in fact there is essentially no difference, due to the way a GPU loads and stores to/from global memory. Although your code does not appear to touch .w, in the process of accessing the adjacent elements, the GPU will load the .w elements from global memory, and also (eventually) store the .w elements back to global memory.

我不明白为什么分析器报告理想的访问将涉及第 247 行的每次访问 8 个 L2 事务

What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247

对于原始代码中的第 247 行，每个线程访问一个 float 数量以用于 acc.x 的加载操作，以及一个 floatvel.x 的加载操作的每个线程的代码数量.每个线程的 float 数量本身应该需要 128 字节的 warp，即 4 个 32 字节的 L2 缓存线.两个加载一起需要 8 个 L2 缓存线加载.这是理想的情况，假设数量很好地打包在一起(SoA).但这不是您所拥有的(您拥有 AoS).

For line 247 in your original code, you are accessing one float quantity per thread for the load operation of acc.x, and one float quantity per thread for the load operation of vel.x. A float quantity per thread by itself should require 128 bytes for a warp, which is 4 32-byte L2 cachelines. Two loads together would require 8 L2 cacheline loads. This is the ideal case, which assumes that the quantities are packed together nicely (SoA). But that is not what you have (you have AoS).

这篇关于CUDA 分析器报告全局内存访问效率低下的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

CUDA 分析器报告全局内存访问效率低下 [英] CUDA profiler reports inefficient global memory access

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

CUDA 分析器报告全局内存访问效率低下 [英] CUDA profiler reports inefficient global memory access

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭