优化用于三角矩阵计算的CUDA内核的执行 [英] Optimizing execution of a CUDA kernel for Triangular Matrix calculation

查看:130
本文介绍了优化用于三角矩阵计算的CUDA内核的执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发我的第一个Cuda应用程序,并且我的内核具有低于预期的吞吐量,这似乎是当前最大的瓶颈。

I am developing my first Cuda application, and I have a kernel with "below-expected throughput", which seems to be the biggest bottleneck at the moment.

内核的任务是计算N×N大小的矩阵( DD ),其中包含数据矩阵上所有元素之间的平方距离。数据矩阵( Y )的大小为N乘D(以支持多维数据),并存储为以行为主。

The task of the kernel is to compute an N by N sized matrix (DD) containing squared distances between all elements on a data matrix. The data matrix (Y) is size N by D (to support multi dimensional data) and stored as row-major.

来源:

__global__ void computeSquaredEuclideanDistance(const float * __restrict__ Y, float * __restrict__ DD, const int N, const int D) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < N * N; i += stride) {
        const int m = i / N;
        const int n = i % N;
        float tmp = 0;
        for (int d = 0; d < D; ++d) {
            const float Ynd = Y[d + D * n];
            const float Ymd = Y[d + D * m];
            const float Ydiff = Ynd - Ymd;
            tmp += Ydiff * Ydiff;
        }
        DD[n + N * m] = tmp;
    }
}

这是通过来调用的size_t blockSize = 256 size_t numBlocks =(N * N + blockSize-1)/ blockSize

如何优化此内核?我最初的想法是,耗时的部分是在不利用某种共享内存的情况下读取数据,但是任何人都可以给我有关如何实现此目标的指示吗?

How can I optimize this kernel? My initial thought is that the time-consuming part is reading data without exploiting some sort of shared memory, but can anyone give me pointers on how to approach this?

nvvc 分析工具:


  • 延迟分析


    • 计算利用率约为40%

    • 内存(二级缓存)利用率约为35%

    • Latency analysis:
      • Compute utilization at around 40%
      • Memory (L2 cache) utilization at around 35%

      • 活动翘曲在理论值64的57.59处

      • 在理论值100的90%处

      对于我的应用程序,典型值为:

      For my application, typical values are:


      • 5k< N < 30k

      • D 是2或3

      • 5k < N < 30k
      • D is either 2 or 3

      推荐答案

      我通常不考虑这些类型的优化问题,因为在我看来,它们正处于偏离主题的边缘。更糟糕的是,您没有提供 MCVE ,因此任何想回答的人都必须编写自己的所有支持代码才能编译和基准化您的内核。这类工作确实需要基准测试和代码分析。但是因为您的问题基本上是线性代数问题(并且我喜欢线性代数),所以我回答了这个问题而不是将其投票的范围太广了……

      I typically disregard these types of optimization questions because they are on the verge of off-topic, in my opinion. Worst still, you provide no MCVE so anyone trying to answer would have to write all their own support code to compile and benchmark your kernel. And this sort of work does require benchmarking and code analysis. But because your problem is basically a linear algebra problem (and I like linear algebra), I answered it rather than close voting it as too broad......

      从我的胸部。有几件事会立即在代码中跳出来,可以加以改进,并且可能会对运行时间产生重大影响。

      With that off my chest. there are a couple of things which immediately jump out in the code which could be improved and which would probably have a material affect on the run time.

      首先是先验地知道了内部循环的行程数。每当遇到这种情况时,请告知编译器。循环展开和代码重新排序是非常强大的编译器优化,而NVIDIA编译器则非常擅长。如果将D移到模板参数中,则可以执行以下操作:

      The first is that the trip count of the inner loop is known a priori. Anytime you have a situation like that, let the compiler know. Loop unrolling and code reordering is a very powerful compiler optimization and the NVIDIA compiler is extremely good at it. If you move D into a template parameter, you can do something like this:

      template<int D>
      __device__ float esum(const float *x, const float *y)
      {
          float val = 0.f;
      #pragma unroll
          for(int i=0; i<D; i++) { 
              float diff = x[i] - y[i];
              val += diff * diff;
          }
          return val;
      }
      
      template<int D>
      __global__ 
      void vdistance0(const float * __restrict__ Y, float * __restrict__ DD, const int N)
      {
          int index = blockIdx.x * blockDim.x + threadIdx.x;
          int stride = blockDim.x * gridDim.x;
          for (int i = index; i < N * N; i += stride) {
              const int m = i / N;
              const int n = i % N;
              DD[n + N * m] = esum<D>(Y + D * n, Y + D * m);
          }
      }
      
      template __global__ void vdistance0<2>(const float *, float *, const int);
      template __global__ void vdistance0<3>(const float *, float *, const int);
      

      编译器将内联 esum 并展开内部循环,然后可以使用其重新排序试探法更好地交错负载和触发器,以提高吞吐量。结果代码的寄存器占用空间也较小。当我在N = 10000和D = 2上运行此程序时,我的速度提高了35%(7.1ms,而使用CUDA 9.1的GTX 970为4.5ms)。

      The compiler will inline esum and unroll the inner loop and it can then use its reordering heuristics to better interleave loads and flops to improve throughput. The resulting code has a lower register footprint too. When I run this for N=10000 and D=2, I get about 35% speed up (7.1ms versus 4.5ms on a GTX 970 with CUDA 9.1).

      但是,还有比这更明显的优化。您正在执行的计算将产生一个对称的输出矩阵。您只需要执行(N * N)/ 2 个运算即可计算完整矩阵,而不是 N * N 您正在执行您的代码[技术上 N(N / 2 -1),因为对角线项为零,但出于讨论的目的,请忽略对角线]。

      But there is an even more glaringly obvious optimization than this. The calculation you are performing will produce a symmetric output matrix. You only need to do (N*N)/2 operations to compute the full matrix, rather than the N*N you are doing in your code [technically N(N/2 -1) because the diagonal entries are zero, but lets forget the diagonal for the purposes of this discussion].

      因此,采用另一种方法并使用一个块来计算上三角输出矩阵的每一行,则可以执行以下操作:

      So taking a different approach and using one block to calculate each row of the upper triangular output matrix, then you can do something like this:

      struct udiag
      {
          float *p;
          int m;
      
          __device__ __host__ udiag(float *_p, int _m) : p(_p), m(_m) {};
          __device__ __host__ float* get_row(int i) { return p + (i * (i + 1)) / 2; };
      };
      
      
      template<int D>
      __global__ 
      void vdistance2(const float * __restrict__ Y, float * __restrict__ DD, const int N)
      {
           int rowid = blockIdx.x;
           int colid = threadIdx.x;
           udiag m(DD, N);
      
           for(; rowid < N; rowid += gridDim.x) {
               float* p = m.get_row(rowid);
               const float* y = Y + D * rowid;
               for(int i=colid; i < (N-rowid); i += blockDim.x) {
                   p[i] = esum<D>(y, y + D * i);
               }
          }
      }
      template __global__ void vdistance2<2>(const float *, float *, const int);
      template __global__ void vdistance2<3>(const float *, float *, const int);
      

      此函数使用一个小辅助类来封装上三角输出的寻址方案所需的三角数矩阵。这样做可以节省大量的内存和内存带宽,并减少计算的总FLOP数量。如果以后需要做其他事情,BLAS(和CUBLAS)将支持上或下三角矩阵的计算。使用它们。当我运行此程序时,我得到了大约75%的加速(同一GTX 970上为7.1ms对1.6ms)。

      This uses a little helper class to encapsulate the triangle numbers needed for the addressing scheme for the upper triangular output matrix. Doing this saves an enormous amount of memory and memory bandwidth as well as reducing the total FLOP count for the calculation. If you need to do other things afterwards BLAS (and CUBLAS) supports computations on upper or lower triangular matrices. Use them. When I run this I get about 75% speedup (7.1ms versus 1.6ms on the same GTX 970).

      巨大的免责声明:您在此处看到的所有代码都是在45分钟的午餐休息时间编写的,并且非常轻轻测试。我绝对不声称此答案中的任何内容实际上都是正确的。我已经确认它可以编译并在运行以获取分析数据时不会产生运行时错误。这就对了。 Cavaet Emptor 等等。

      Huge disclaimer: All the code you see here was written during a 45 minute lunch break and as been very lightly tested. I make absolutely no claims that anything in this answer is actually correct. I have confirmed that it compiles and doesn't produce a runtime error when I run it to get profiling data. That is it. Cavaet Emptor and all that.

      这篇关于优化用于三角矩阵计算的CUDA内核的执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆