L2缓存的内存操作是否比NVIDIA GPU的全局内存快得多? [英] Is memory operation for L2 cache significantly faster than global memory for NVIDIA GPU?

查看:190
本文介绍了L2缓存的内存操作是否比NVIDIA GPU的全局内存快得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现代GPU架构同时具有L1缓存和L2缓存.众所周知,L1缓存比全局内存快得多.但是,L2缓存的速度在CUDA文档中不太清楚.我查阅了CUDA文档,但只能发现全局内存操作的延迟大约为300-500个周期,而L1缓存操作仅花费大约30个周期.任何人都可以给出二级缓存的速度吗?这样的信息可能非常有用,因为与全局内存相比,如果编程速度不是很快的话,编程将不会专注于优化L2缓存的使用.如果不同架构的速度不同,我只想关注最新架构,例如NVIDIA Titan RTX 3090(计算能力8.6)或NVIDIA Telsa V100(计算能力7.0).

Modern GPU architectures have both L1 cache and L2 cache. It is well-known that L1 cache is much faster than global memory. However, the speed of L2 cache is less clear in the CUDA documentation. I looked up the CUDA documentation, but can only find that the latency of global memory operation is about 300-500 cycles while L1 cache operation takes only about 30 cycles. Can anyone give the speed of L2 cache? Such information may be very useful, since the programming will not focus on optimizing the use of L2 cache if it is not very fast compared with global memory. If the speed is different for different architectures, I just want to focus on the latest architecture, such as NVIDIA Titan RTX 3090 (Compute Capability 8.6) or NVIDIA Telsa V100 (Compute Capability 7.0).

谢谢!

推荐答案

在讨论GPU内存时,通常至少有两个优点:延迟和带宽.从延迟的角度来看,这个数字不是由NVIDIA(我知道)发布的,通常的做法是仔细地

There are at least 2 figures of merit commonly used when discussing GPU memory: latency and bandwidth. From a latency perspective, this number is not published by NVIDIA (that I know of) and the usual practice is to discover it with careful microbenchmarking.

从带宽的角度来看,AFAIK这个数字也不是由NVIDIA发布(针对L2缓存),但是通过一个相当简单的复制内核测试用例,应该很容易发现它.我们可以简单地通过确保我们的复制内核使用比已发布的L2缓存大小(V100为6MB)大得多的副本占用空间来估计全局内存的带宽,而我们可以通过将副本占用空间保持为小于以下大小来估计L2的带宽.那个.

From a bandwidth perspective, AFAIK this number is also not published by NVIDIA (for L2 cache), but it should be fairly easy to discover it with a fairly simple test case of a copy kernel. We can estimate the bandwidth of global memory simply by ensuring that our copy kernel uses a copy footprint that is much larger than the published L2 cache size (6MB for V100), whereas we can estimate the bandwidth of L2 by keeping our copy footprint smaller than that.

编写这样的代码(IMO)相当简单:

Such a code (IMO) is fairly trivial to write:

$ cat t44.cu
template <typename T>

__global__ void k(volatile T * __restrict__ d1, volatile T * __restrict__ d2, const int loops, const int ds){

  for (int i = 0; i < loops; i++)
    for (int j = threadIdx.x+blockDim.x*blockIdx.x; j < ds; j += gridDim.x*blockDim.x)
      if (i&1) d1[j] = d2[j];
      else d2[j] = d1[j];
}
const int dsize = 1048576*128;
const int iter = 64;
int main(){

  int *d;
  cudaMalloc(&d, dsize);
  // case 1: 32MB copy, should exceed L2 cache on V100
  int csize = 1048576*8;
  k<<<80*2, 1024>>>(d, d+csize, iter, csize);
  // case 2: 2MB copy, should fit in L2 cache on V100
  csize = 1048576/2;
  k<<<80*2, 1024>>>(d, d+csize, iter, csize);
  cudaDeviceSynchronize();
}

$ nvcc -o t44 t44.cu
$ nvprof ./t44
==53310== NVPROF is profiling process 53310, command: ./t44
==53310== Profiling application: ./t44
==53310== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  6.9032ms         2  3.4516ms  123.39us  6.7798ms  void k<int>(int volatile *, int volatile *, int, int)
      API calls:   89.47%  263.86ms         1  263.86ms  263.86ms  263.86ms  cudaMalloc
                    4.45%  13.111ms         8  1.6388ms  942.75us  2.2322ms  cuDeviceTotalMem
                    3.37%  9.9523ms       808  12.317us     186ns  725.86us  cuDeviceGetAttribute
                    2.34%  6.9006ms         1  6.9006ms  6.9006ms  6.9006ms  cudaDeviceSynchronize
                    0.33%  985.49us         8  123.19us  85.864us  180.73us  cuDeviceGetName
                    0.01%  42.668us         8  5.3330us  1.8710us  22.553us  cuDeviceGetPCIBusId
                    0.01%  34.281us         2  17.140us  6.2880us  27.993us  cudaLaunchKernel
                    0.00%  8.0290us        16     501ns     256ns  1.7980us  cuDeviceGet
                    0.00%  3.4000us         8     425ns     217ns     876ns  cuDeviceGetUuid
                    0.00%  3.3970us         3  1.1320us     652ns  2.0020us  cuDeviceGetCount
$

基于事件探查器的输出,我们可以将全局内存带宽估算为:

Based on the profiler output, we can estimate global memory bandwidth as:

2*64*32MB/6.78ms = 604GB/s

我们可以估计L2带宽为:

we can estimate L2 bandwidth as:

2*64*2MB/123us   = 2.08TB/s

这两个都是粗略的测量(我在这里没有做仔细的基准测试),但是此V100 GPU上的 bandwidthTest 报告的设备内存带宽为〜700GB/s,所以我相信600GB/号码是在球场上".如果我们以此来判断L2缓存的测量值是否合理,那么我们可能会猜测L2缓存在某些情况下可能比全局内存快约3-4倍.

Both of these are rough measurements (I'm not doing careful benchmarking here), but bandwidthTest on this V100 GPU reports a device memory bandwidth of ~700GB/s, so I believe the 600GB/s number is "in the ballpark". If we use that to judge that the L2 cache measurement is in the ballpark, then we might guess that the L2 cache may be ~3-4x faster than global memory in some circumstances.

这篇关于L2缓存的内存操作是否比NVIDIA GPU的全局内存快得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆