CUDA探查器报告无效的全局内存访问 [英] CUDA profiler reports inefficient global memory access

查看:94
本文介绍了CUDA探查器报告无效的全局内存访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的CUDA内核,我认为它可以有效地访问全局内存. Nvidia探查器报告我执行的全局内存访问效率低下.我的内核代码是:

I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses. My kernel code is:

__global__ void update_particles_kernel
(
    float4 *pos, 
    float4 *vel, 
    float4 *acc, 
    float dt, 
    int numParticles
)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
int offset = 0;

while(index + offset < numParticles)
{
    vel[index + offset].x += dt*acc[index + offset].x;   // line 247
    vel[index + offset].y += dt*acc[index + offset].y;
    vel[index + offset].z += dt*acc[index + offset].z;

    pos[index + offset].x += dt*vel[index + offset].x;   // line 251
    pos[index + offset].y += dt*vel[index + offset].y;
    pos[index + offset].z += dt*vel[index + offset].z;

    offset += blockDim.x * gridDim.x;
}

尤其是探查器报告以下内容:

In particular the profiler reports the following:

从CUDA中 对于计算能力为2.x的设备,要求可以很容易地概括:经线线程的并发访问将合并为多个事务,这些事务等于服务所需的缓存行数默认情况下,所有访问都通过L1(128字节行)进行缓存.对于分散的访问模式,为减少超量获取,有时仅在L2中进行缓存很有用,因为L2可以缓存较短的32-字节段(请参阅《 CUDA C编程指南》).

对于计算能力为3.x的设备,对全局内存的访问仅缓存在L2中; L1保留用于本地内存访问.一些具有计算能力3.5、3.7或5.2的设备也允许在L1中选择启用全局缓存."

现在在我的内核中,根据此信息,我希望将需要16次访问才能服务32线程扭曲,因为float4是16个字节,并且在我的卡(770m计算能力3.0)上,从32位高速缓存中读取数据的操作是32位字节块(16字节* 32个线程/32字节缓存行= 16次访问).确实,如您所见,探查器报告我正在执行16次访问.我不明白的是,为什么探查器报告理想的访问将涉及行247的每个访问8个L2事务,而其余行每个访问仅涉及4个L2事务.有人可以解释我在这里缺少什么吗?

Now in my kernel based on this information I would expect that 16 accesses would be required to service a 32 thread warp because float4 is 16 bytes and on my card (770m compute capability 3.0) reads from the L2 cache are performed in 32 bytes chunks (16 bytes * 32 threads / 32 bytes cache lines = 16 accesses). Indeed as you can see the profiler reports that I am doing 16 access. What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247 and only 4 L2 transactions per access for the remaining lines. Can someone explain what I am missing here?

推荐答案

我有一个简单的CUDA内核,我认为它可以有效地访问全局内存. Nvidia探查器报告我执行的全局内存访问效率低.

I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses.

举个例子,您的float4 vel数组像这样存储在内存中:

To take one example, your float4 vel array is stored in memory like this:

0.x 0.y 0.z 0.w 1.x 1.y 1.z 1.w 2.x 2.y 2.z 2.w 3.x 3.y 3.z 3.w ...
  ^               ^               ^               ^             ...
  thread0         thread1         thread2         thread3

因此,当您这样做时:

vel[index + offset].x += ...;   // line 247

您正在访问(存储)我在上面标记的位置(.x).每个^标记之间的间隙表示分析器指出的无效访问模式. (在下一行代码中,将存储到.y位置无关紧要.)

you are accessing (storing) at the locations (.x) that I have marked above. The gaps in between each ^ mark indicate an inefficient access pattern, which the profiler is pointing out. (It does not matter that in the very next line of code, you are storing to the .y locations.)

至少有2种解决方案,其中一种是经典的AoS->通过适当的代码调整对数据进行SoA重组.这是有据可查的(例如,此处(在cuda标记和其他位置)上的含义以及操作方式,因此我将让您查找.

There are at least 2 solutions, one of which would be a classical AoS -> SoA reorganization of your data, with appropriate code adjustments. This is well documented (e.g. here on the cuda tag and elsewhere) in terms of what it means, and how to do it, so I will let you look that up.

另一种典型的解决方案是在需要时为每个线程加载一个float4数量,并在需要时为每个线程存储一个float4数量.可以对您的代码进行微不足道的修改以完成此操作,这应该可以改善分析结果:

The other typical solution is to load a float4 quantity per thread, when you need it, and store a float4 quantity per thread, when you need to. Your code can be trivially reworked to do this, which should give improved profiling results:

//preceding code need not change
while(index + offset < numParticles)
{
    float4 my_vel = vel[index + offset];
    float4 my_acc = acc[index + offset];
    my_vel.x += dt*my_acc.x;   
    my_vel.y += dt*my_acc.y;
    my_vel.z += dt*my_acc.z;
    vel[index + offset] = my_vel;

    float4 my_pos = pos[index + offset];
    my_pos.x += dt*my_vel.x; 
    my_pos.y += dt*my_vel.y;
    my_pos.z += dt*my_vel.z;
    pos[index + offset] = my_pos;

    offset += blockDim.x * gridDim.x;
}

即使您可能认为此代码比代码效率低",因为您的代码似乎"仅加载和存储.x.y.z,而我的出现"于也加载和存储.w,实际上没有任何区别,这是因为GPU加载和存储到全局内存或从全局内存中存储的方式.尽管您的代码似乎没有碰到.w,但是在访问相邻元素的过程中,GPU将从全局内存中加载.w元素,并且(最终)还将.w元素存储回全局内存中.

Even though you might think that this code is "less efficient" than your code, because your code "appears" to be only loading and storing .x, .y, .z, whereas mine "appears" to also load and store .w, in fact there is essentially no difference, due to the way a GPU loads and stores to/from global memory. Although your code does not appear to touch .w, in the process of accessing the adjacent elements, the GPU will load the .w elements from global memory, and also (eventually) store the .w elements back to global memory.

我不明白的是,为什么探查器报告理想的访问将涉及第247行的每次访问8个L2事务

What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247

对于原始代码中的第247行,对于acc.x的加载操作,每个线程访问一个float数量,对于vel.x的加载操作,每个线程访问一个float数量.每个线程本身的float数量应需要128个字节才能进行扭曲,即4条32字节L2高速缓存行.两个负载加在一起将需要8个L2缓存行负载.这是理想的情况,它假设数量很好地打包在一起(SoA).但这不是您所拥有的(您拥有AoS).

For line 247 in your original code, you are accessing one float quantity per thread for the load operation of acc.x, and one float quantity per thread for the load operation of vel.x. A float quantity per thread by itself should require 128 bytes for a warp, which is 4 32-byte L2 cachelines. Two loads together would require 8 L2 cacheline loads. This is the ideal case, which assumes that the quantities are packed together nicely (SoA). But that is not what you have (you have AoS).

这篇关于CUDA探查器报告无效的全局内存访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆