全局内存访问和Kepler中的L1缓存 [英] Global memory access and L1 cache in Kepler

查看:121
本文介绍了全局内存访问和Kepler中的L1缓存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Kepler硬件的Visual Profiler中对我的内核进行性能分析时,我注意到分析器显示全局加载和存储在L1中缓存。
我很困惑,因为编程指南和开普勒调整手动状态:


Kepler GPU中的L1缓存仅保留本地存储器访问,例如寄存器
溢出和堆栈数据。全局加载只缓存在L2(或只读数据
缓存)。


没有注册溢出(分析器显示L1缓存甚至原始的,2行'添加'内核)和我不知道什么'堆栈数据'在这里。



GK110白皮书显示,全局访问将通过L1缓存,除了一种情况:通过只读缓存(__ldg)加载。
这是否意味着,虽然全局访问通过L1硬件,它们实际上没有缓存?这是否也意味着,如果我溢出寄存器数据缓存在L1,这个数据可以驱逐gmem访问的结果?



UPDATE :我意识到我可能误读了profiler给我的信息,所以这里是内核代码作为分析器结果(我已经尝试过在Titan和K40有相同的结果)。

  template< typename T& 
__global__ void addKernel(T * c,const T * a,const T * b)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
c [i] = a [i] + b [i];
}

...
//内核调用
float * x;
float * y;
float * d;
// ...
addKernel<<< 1024 1024>>(d,x,y)
cudaError_t cudaStatus = cudaDeviceSynchronize();
assert(cudaSuccess == cudaStatus);

Visual Profiler输出:





L1数字在给定L1缓存已启用gmem访问。对于我们有的负载:



65536 * 128 == 2 * 4 * 1024 * 1024



UPDATE 2 :添加了SASS和PTX代码。 SASS代码非常简单,包含从常量存储器读取和从全局存储器加载/存储(LD / ST指令)。

 功能:_Z9addKernelIfEvPT_PKS0_S3_ 
.headerflags @EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)
/ * * 0x088cb0a0a08c1000 /
/ * 0008 * / MOV R1,C [为0x0] [0x44进行] / * 0x64c03c00089c0006 * /
/ * 0010 * / S2R R0,SR_CTAID.X; / * 0x86400000129c0002 * /
/ * 0018 * / MOV32I R5,0x4; / * 0x74000000021fc016 * /
/ * 0020 * / S2R R3,SR_TID.X; / * 0x86400000109c000e * /
/ * 0028 * / IMAD R2,R0,c [0x0] [0x28],R3; / * 0x51080c00051c000a * /
/ * 0030 * / IMAD R6.CC,R2,R5,c [0x0] [0x148]; / * 0x910c1400291c081a * /
/ * 0038 * / IMAD.HI.X R7,R2,R5,c [0x0] [0x14c] / * * 0x93181400299c081e /
/ * * 0x08a0a4b0809c80b0 /
/ * 0048 * / IMAD R8.CC,R2,R5,C [为0x0] [量0x150]; / * 0x910c14002a1c08​​22 * /
/ * 0050 * / IMAD.HI.X R9,R2,R5,c [0x0] [0x154]; / * 0x931814002a9c0826 * /
/ * 0058 * / LD.E R3,[R6]; / * 0xc4800000001c180c * /
/ * 0060 * / LD.E R0,[R8]; / * 0xc4800000001c2000 * /
/ * 0068 * / IMAD R4.CC,R2,R5,c [0x0] [0x140]; / * 0x910c1400281c0812 * /
/ * 0070 * / IMAD.HI.X R5,R2,R5,c [0x0] [0x144]; / * 0x93181400289c0816 * /
/ * 0078 * / FADD R0,R3,R0; / * * 0xe2c00000001c0c02 /
/ * * 0x080000000000b810 /
/ * 0088 * / ST.E [R4],R0; / * 0xe4800000001c1000 * /
/ * 0090 * / EXIT; / * 0x18000000001c003c * /
/ * 0098 * / BRA 0x98; / * 0x12007ffffc1c003c * /
/ * 00a0 * / NOP; / * 0x85800000001c3c02 * /
/ * 00a8 * / NOP; / * 0x85800000001c3c02 * /
/ * 00b0 * / NOP; / * 0x85800000001c3c02 * /
/ * 00b8 * / NOP; / * 0x85800000001c3c02 * /



PTX:

 。可见.entry _Z9addKernelIfEvPT_PKS0_S3_(
.PARAM .u64 _Z9addKernelIfEvPT_PKS0_S3__param_0,
.PARAM .u64 _Z9addKernelIfEvPT_PKS0_S3__param_1,
.PARAM .u64 _Z9addKernelIfEvPT_PKS0_S3__param_2

{
.reg .s32%r 5;
.reg .f32%f 4;
.reg .s64%rd 11;

ld.param.u64%rd1,[_Z9addKernelIfEvPT_PKS0_S3__param_0];
ld.param.u64%rd2,[_Z9addKernelIfEvPT_PKS0_S3__param_1];
ld.param.u64%rd3,[_Z9addKernelIfEvPT_PKS0_S3__param_2];
cvta.to.global.u64%rd4,%rd1;
.loc 1 22 1
mov.u32%r1,%ntid.x;
mov.u32%r2,%ctaid.x;
mov.u32%r3,%tid.x;
mad.lo.s32%r4,%r1,%r2,%r3;
cvta.to.global.u64%rd5,%rd2;
mul.wide.s32%rd6,%r4,4;
add.s64%rd7,%rd5,%rd6;
cvta.to.global.u64%rd8,%rd3;
add.s64%rd9,%rd8,%rd6;
.loc 1 23 1
ld.global.f32%f1,[%rd9];
ld.global.f32%f2,[%rd7];
add.f32%f3,%f2,%f1;
add.s64%rd10,%rd4,%rd6;
.loc 1 23 1
st.global.f32 [%rd10],%f3;
.loc 1 24 2
ret;
}


解决方案

在费米和开普勒架构的所有通用,全局,本地和共享存储器操作由L1高速缓存处理。共享内存访问不需要标记查找,也不会使缓存行无效。所有本地和全局存储器访问都需要查找标签。非高速缓存的全局内存存储和读取将使高速缓存行无效。在计算能力3.0和3.5上,除了CC 3.5上的LDG之外的所有全局内存读取都将被取消缓存。 LDG指令通过纹理缓存。


While profiling my kernels in Visual Profiler on Kepler hardware, I’ve noticed the profiler shows that global loads and stores are cached in L1. I'm confused because the programming guide and Kepler tuning manual state that:

L1 caching in Kepler GPUs is reserved only for local memory accesses, such as register spills and stack data. Global loads are cached in L2 only (or in the Read-Only Data Cache).

There are no register spills (profiler shows L1 caching even for primitive, 2-lines 'add' kernel) and I'm not sure what 'stack data' means here.

GK110 Whitepaper shows that global accesses will go through L1 cache in all but one case: loads through read-only cache (__ldg). Does it mean that while global accesses go through L1 hardware they are not actually cached? Does it also mean that if I have spilled registers data cached in L1, this data can be evicted as a result of gmem access?

UPDATE: I've realized that I might be misreading the information the profiler is giving to me, so here is the kernel code as well as profiler results (I've tried both on Titan and K40 with the same results).

template<typename T>
__global__ void addKernel(T *c, const T *a, const T *b)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    c[i] = a[i] + b[i];
}

...
// Kernel call
float* x;
float* y;
float* d;
// ...
addKernel<<<1024, 1024>>>(d, x, y);
cudaError_t cudaStatus = cudaDeviceSynchronize();
assert(cudaSuccess == cudaStatus);

Visual Profiler output:

L1 numbers make perfect sense given L1 cache is enabled for gmem accesses. For the loads we have:

65536 * 128 == 2 * 4 * 1024 * 1024

UPDATE 2: added SASS and PTX code. SASS code is very simple and contains reads from constant memory and loads/stores from/to global memory (LD/ST instructions).

Function : _Z9addKernelIfEvPT_PKS0_S3_
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                             /* 0x088cb0a0a08c1000 */
/*0008*/                MOV R1, c[0x0][0x44];                /* 0x64c03c00089c0006 */
/*0010*/                S2R R0, SR_CTAID.X;                  /* 0x86400000129c0002 */
/*0018*/                MOV32I R5, 0x4;                      /* 0x74000000021fc016 */
/*0020*/                S2R R3, SR_TID.X;                    /* 0x86400000109c000e */
/*0028*/                IMAD R2, R0, c[0x0][0x28], R3;       /* 0x51080c00051c000a */
/*0030*/                IMAD R6.CC, R2, R5, c[0x0][0x148];   /* 0x910c1400291c081a */
/*0038*/                IMAD.HI.X R7, R2, R5, c[0x0][0x14c]; /* 0x93181400299c081e */
                                                             /* 0x08a0a4b0809c80b0 */
/*0048*/                IMAD R8.CC, R2, R5, c[0x0][0x150];   /* 0x910c14002a1c0822 */
/*0050*/                IMAD.HI.X R9, R2, R5, c[0x0][0x154]; /* 0x931814002a9c0826 */
/*0058*/                LD.E R3, [R6];                       /* 0xc4800000001c180c */
/*0060*/                LD.E R0, [R8];                       /* 0xc4800000001c2000 */
/*0068*/                IMAD R4.CC, R2, R5, c[0x0][0x140];   /* 0x910c1400281c0812 */
/*0070*/                IMAD.HI.X R5, R2, R5, c[0x0][0x144]; /* 0x93181400289c0816 */
/*0078*/                FADD R0, R3, R0;                     /* 0xe2c00000001c0c02 */
                                                             /* 0x080000000000b810 */
/*0088*/                ST.E [R4], R0;                       /* 0xe4800000001c1000 */
/*0090*/                EXIT ;                               /* 0x18000000001c003c */
/*0098*/                BRA 0x98;                            /* 0x12007ffffc1c003c */
/*00a0*/                NOP;                                 /* 0x85800000001c3c02 */
/*00a8*/                NOP;                                 /* 0x85800000001c3c02 */
/*00b0*/                NOP;                                 /* 0x85800000001c3c02 */
/*00b8*/                NOP;                                 /* 0x85800000001c3c02 */

PTX:

.visible .entry _Z9addKernelIfEvPT_PKS0_S3_(
.param .u64 _Z9addKernelIfEvPT_PKS0_S3__param_0,
.param .u64 _Z9addKernelIfEvPT_PKS0_S3__param_1,
.param .u64 _Z9addKernelIfEvPT_PKS0_S3__param_2
)
{
.reg .s32 %r<5>;
.reg .f32 %f<4>;
.reg .s64 %rd<11>;

ld.param.u64 %rd1, [_Z9addKernelIfEvPT_PKS0_S3__param_0];
ld.param.u64 %rd2, [_Z9addKernelIfEvPT_PKS0_S3__param_1];
ld.param.u64 %rd3, [_Z9addKernelIfEvPT_PKS0_S3__param_2];
cvta.to.global.u64 %rd4, %rd1;
.loc 1 22 1
mov.u32 %r1, %ntid.x;
mov.u32 %r2, %ctaid.x;
mov.u32 %r3, %tid.x;
mad.lo.s32 %r4, %r1, %r2, %r3;
cvta.to.global.u64 %rd5, %rd2;
mul.wide.s32 %rd6, %r4, 4;
add.s64 %rd7, %rd5, %rd6;
cvta.to.global.u64 %rd8, %rd3;
add.s64 %rd9, %rd8, %rd6;
.loc 1 23 1
ld.global.f32 %f1, [%rd9];
ld.global.f32 %f2, [%rd7];
add.f32 %f3, %f2, %f1;
add.s64 %rd10, %rd4, %rd6;
.loc 1 23 1
st.global.f32 [%rd10], %f3;
.loc 1 24 2
ret;
}

解决方案

On Fermi and Kepler architectures all generic, global, local, and shared memory operations are handled by the L1 cache. Shared memory accesses do not require a tag look up and do not invalidate a cache line. All local and global memory accesses require a tag look up. Uncached global memory stores and reads will invalidate a cache line. On compute capability 3.0 and 3.5 all global memory reads with exception to LDG on CC 3.5 will be uncached. LDG instruction goes through the texture cache.

这篇关于全局内存访问和Kepler中的L1缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆