Kepler 中的全局内存访问和 L1 缓存 [英] Global memory access and L1 cache in Kepler

查看：23 发布时间：2022/1/10 16:16:56 cuda

本文介绍了Kepler 中的全局内存访问和 L1 缓存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 Kepler 硬件上使用 Visual Profiler 分析我的内核时，我注意到分析器显示全局加载和存储缓存在 L1 中.我很困惑，因为编程指南和开普勒调整手册指出:

While profiling my kernels in Visual Profiler on Kepler hardware, I’ve noticed the profiler shows that global loads and stores are cached in L1. I'm confused because the programming guide and Kepler tuning manual state that:

Kepler GPU 中的 L1 缓存仅保留用于本地内存访问，例如寄存器溢出和堆栈数据.全局负载仅在 L2 中缓存(或在只读数据中)缓存).

L1 caching in Kepler GPUs is reserved only for local memory accesses, such as register spills and stack data. Global loads are cached in L2 only (or in the Read-Only Data Cache).

没有寄存器溢出(分析器显示 L1 缓存，即使是原始的 2 行添加"内核)，我不确定堆栈数据"在这里是什么意思.

There are no register spills (profiler shows L1 caching even for primitive, 2-lines 'add' kernel) and I'm not sure what 'stack data' means here.

GK110 白皮书显示，除了一种情况外，全局访问将通过 L1 缓存:通过只读缓存 (__ldg) 加载.这是否意味着虽然全局访问通过 L1 硬件，但它们实际上并没有被缓存?这是否也意味着如果我溢出了缓存在 L1 中的寄存器数据，这些数据可以作为 gmem 访问的结果被驱逐?

GK110 Whitepaper shows that global accesses will go through L1 cache in all but one case: loads through read-only cache (__ldg). Does it mean that while global accesses go through L1 hardware they are not actually cached? Does it also mean that if I have spilled registers data cached in L1, this data can be evicted as a result of gmem access?

更新:我意识到我可能误读了分析器提供给我的信息，所以这里是内核代码和分析器结果(我在 Titan 和K40 具有相同的结果).

UPDATE: I've realized that I might be misreading the information the profiler is giving to me, so here is the kernel code as well as profiler results (I've tried both on Titan and K40 with the same results).

template<typename T>
__global__ void addKernel(T *c, const T *a, const T *b)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    c[i] = a[i] + b[i];
}

...
// Kernel call
float* x;
float* y;
float* d;
// ...
addKernel<<<1024, 1024>>>(d, x, y);
cudaError_t cudaStatus = cudaDeviceSynchronize();
assert(cudaSuccess == cudaStatus);

Visual Profiler 输出:

Visual Profiler output:

考虑到为 gmem 访问启用了 L1 缓存，L1 数字非常有意义.对于我们的负载:

L1 numbers make perfect sense given L1 cache is enabled for gmem accesses. For the loads we have:

65536 * 128 == 2 * 4 * 1024 * 1024

更新 2:添加了 SASS 和 PTX 代码.SASS 代码非常简单，包含从常量内存读取和从/向全局内存加载/存储(LD/ST 指令).

UPDATE 2: added SASS and PTX code. SASS code is very simple and contains reads from constant memory and loads/stores from/to global memory (LD/ST instructions).

Function : _Z9addKernelIfEvPT_PKS0_S3_
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                             /* 0x088cb0a0a08c1000 */
/*0008*/                MOV R1, c[0x0][0x44];                /* 0x64c03c00089c0006 */
/*0010*/                S2R R0, SR_CTAID.X;                  /* 0x86400000129c0002 */
/*0018*/                MOV32I R5, 0x4;                      /* 0x74000000021fc016 */
/*0020*/                S2R R3, SR_TID.X;                    /* 0x86400000109c000e */
/*0028*/                IMAD R2, R0, c[0x0][0x28], R3;       /* 0x51080c00051c000a */
/*0030*/                IMAD R6.CC, R2, R5, c[0x0][0x148];   /* 0x910c1400291c081a */
/*0038*/                IMAD.HI.X R7, R2, R5, c[0x0][0x14c]; /* 0x93181400299c081e */
                                                             /* 0x08a0a4b0809c80b0 */
/*0048*/                IMAD R8.CC, R2, R5, c[0x0][0x150];   /* 0x910c14002a1c0822 */
/*0050*/                IMAD.HI.X R9, R2, R5, c[0x0][0x154]; /* 0x931814002a9c0826 */
/*0058*/                LD.E R3, [R6];                       /* 0xc4800000001c180c */
/*0060*/                LD.E R0, [R8];                       /* 0xc4800000001c2000 */
/*0068*/                IMAD R4.CC, R2, R5, c[0x0][0x140];   /* 0x910c1400281c0812 */
/*0070*/                IMAD.HI.X R5, R2, R5, c[0x0][0x144]; /* 0x93181400289c0816 */
/*0078*/                FADD R0, R3, R0;                     /* 0xe2c00000001c0c02 */
                                                             /* 0x080000000000b810 */
/*0088*/                ST.E [R4], R0;                       /* 0xe4800000001c1000 */
/*0090*/                EXIT ;                               /* 0x18000000001c003c */
/*0098*/                BRA 0x98;                            /* 0x12007ffffc1c003c */
/*00a0*/                NOP;                                 /* 0x85800000001c3c02 */
/*00a8*/                NOP;                                 /* 0x85800000001c3c02 */
/*00b0*/                NOP;                                 /* 0x85800000001c3c02 */
/*00b8*/                NOP;                                 /* 0x85800000001c3c02 */

PTX:

.visible .entry _Z9addKernelIfEvPT_PKS0_S3_(
.param .u64 _Z9addKernelIfEvPT_PKS0_S3__param_0,
.param .u64 _Z9addKernelIfEvPT_PKS0_S3__param_1,
.param .u64 _Z9addKernelIfEvPT_PKS0_S3__param_2
)
{
.reg .s32 %r<5>;
.reg .f32 %f<4>;
.reg .s64 %rd<11>;

ld.param.u64 %rd1, [_Z9addKernelIfEvPT_PKS0_S3__param_0];
ld.param.u64 %rd2, [_Z9addKernelIfEvPT_PKS0_S3__param_1];
ld.param.u64 %rd3, [_Z9addKernelIfEvPT_PKS0_S3__param_2];
cvta.to.global.u64 %rd4, %rd1;
.loc 1 22 1
mov.u32 %r1, %ntid.x;
mov.u32 %r2, %ctaid.x;
mov.u32 %r3, %tid.x;
mad.lo.s32 %r4, %r1, %r2, %r3;
cvta.to.global.u64 %rd5, %rd2;
mul.wide.s32 %rd6, %r4, 4;
add.s64 %rd7, %rd5, %rd6;
cvta.to.global.u64 %rd8, %rd3;
add.s64 %rd9, %rd8, %rd6;
.loc 1 23 1
ld.global.f32 %f1, [%rd9];
ld.global.f32 %f2, [%rd7];
add.f32 %f3, %f2, %f1;
add.s64 %rd10, %rd4, %rd6;
.loc 1 23 1
st.global.f32 [%rd10], %f3;
.loc 1 24 2
ret;
}

Kepler 中的全局内存访问和 L1 缓存 [英] Global memory access and L1 cache in Kepler

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Kepler 中的全局内存访问和 L1 缓存 [英] Global memory access and L1 cache in Kepler

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭