做多维矩阵加法的更快方法? [英] Faster way to do multi dimensional matrix addition?

查看:120
本文介绍了做多维矩阵加法的更快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大小为(m * l * 4)的矩阵A,m的大小约为100,000,l = 100.列表的大小始终等于n并且n <= m.我想对给定的索引列表进行矩阵加法. 我已经编写了此函数,并且必须多次调用此函数.

void MatrixAddition(int l, int n, vector<int>& list, int ***A,int ***C,int cluster)
{
    for(int i=0;i<l;i++)
    {
         for(int j=0;j<4;j++)
              C[cluster][i][j]=0;
    }   

for (int i = 0; i < l; i++)
{
        for(int j=0;j<n;++j)
    {
        for(int k=0;k<4;k++)
            C[cluster][i][k]+=A[list[j]][i][k];
    }
}

}

我使用gprof来计算整个代码中每个函数需要多少时间,我发现MatrixAddition函数占用了60%的时间.有没有其他替代方法可以编写此函数,从而减少我的运行时间.

时间秒秒通话ms/通话ms/通话名称
52.00 7.85 7.85 20 392.60 405.49 MatrixAddition(int,int,std :: vector>& ;, int ***,int ***,int)

解决方案

(更新:早期版本的索引编制错误.此版本相当容易自动向量化).

使用C多维数组(而不是指向指针的指针数组)或使用i*cols + j索引的平面1D数组,因此内存是连续的.这将对硬件预取以充分利用内存带宽的有效性产生巨大的影响.具有来自另一个负载的地址的负载确实会降低性能,或者相反,提前知道可预测的地址会很有帮助,因为可以在需要之前就很好地启动负载(由于乱序执行). /p>

此外,@ user31264的答案是正确的,您需要交换循环,因此j上的循环是最外面的.这很好,但仅靠其本身还远远不够.

这也将允许编译器自动向量化.实际上,我很难让gcc很好地自动矢量化. (但这可能是因为我索引错误,因为我只是第一次看了代码.所以编译器不知道我们是在连续内存上循环.)


我在 vpaddd ,未对齐的负载和未对齐的存储返回C.

由于这是 auto 矢量化的,因此应该使用ARM NEON或PPC Altivec或任何可以打包32位加法的东西编写良好的代码.

我无法通过 -ftree-vectorizer-verbose=2 告诉我任何事情.

,但是clang的 -Rpass-analysis=loop-vectorize 有点帮助.

I have a matrix A of size (m * l * 4) and size of m is around 100,000 and l=100. size of list is always equal to n and n <=m. I wanted to do matrix addition of given list of indexes. I have written this function and have to call this function many many number of times.

void MatrixAddition(int l, int n, vector<int>& list, int ***A,int ***C,int cluster)
{
    for(int i=0;i<l;i++)
    {
         for(int j=0;j<4;j++)
              C[cluster][i][j]=0;
    }   

for (int i = 0; i < l; i++)
{
        for(int j=0;j<n;++j)
    {
        for(int k=0;k<4;k++)
            C[cluster][i][k]+=A[list[j]][i][k];
    }
}

}

I use gprof to calculate how much time takes each piece of function in whole code and i found my 60% of time taken by MatrixAddition function. Is there any alternative way to write this function so that my run time reduce.

time seconds seconds calls ms/call ms/call name
52.00 7.85 7.85 20 392.60 405.49 MatrixAddition(int, int, std::vector >&, int***, int***, int)

解决方案

(update: an earlier version had the indexing wrong. This version auto-vectorizes fairly easily).

Use a C multidimensional array (rather than an array of pointers to pointers), or a flat 1D array indexed with i*cols + j, so the memory is contiguous. This will make a huge difference to the effectiveness of hardware prefetching to make good use of memory bandwidth. Loads with an address coming from another load really suck for performance, or conversely, having predictable addresses known well ahead of time helps a lot because the loads can start well before they're needed (thanks to out-of-order execution).

Also, @user31264's answer is correct, you need to interchange the loops so the loop over j is the outer-most. This is good but nowhere near sufficient on its own.

This will also allow the compiler to auto-vectorize. Actually, I had a surprisingly hard time getting gcc to auto-vectorize well. (But that's probably because I got the indexing wrong, because I only looked at the code the first time. So the compiler didn't know we were looping over contiguous memory.)


I played around with it on the Godbolt compiler explorer.

I finally got good nice compiler output from this version, which takes A and C as flat 1D arrays and does the indexing itself:

void MatrixAddition_contiguous(int rows, int n, const  vector<int>& list,
                               const int *__restrict__ A, int *__restrict__ C, int cluster)
  // still auto-vectorizes without __restrict__, but checks for overlap and
  // runs a scalar loop in that case
{
  const int cols = 4;  // or global constexpr or something
  int *__restrict__ Ccluster = C + ((long)cluster) * rows * cols;

  for(int i=0;i<rows;i++)
    //#pragma omp simd  
    for(int k=0;k<4;k++)
      Ccluster[cols*i + k]=0;

  for(int j=0;j < cols;++j) { // loop over clusters in A in the outer-most loop
    const int *__restrict__ Alistj = A + ((long)list[j]) * rows * cols;
    // #pragma omp simd    // Doesn't work: only auto-vectorizes with -O3
    // probably only -O3 lets gcc see through the k=0..3 loop and treat it like one big loop
    for (int i = 0; i < rows; i++) {
      long row_offset = cols*i;
      //#pragma omp simd  // forces vectorization with 16B vectors, so it hurts AVX2
      for(int k=0;k<4;k++)
        Ccluster[row_offset + k] += Alistj[row_offset + k];
    }
  }
}

Manually hoisting the list[j] definitely helped the compiler realize that stores into C can't affect the indices that will be loaded from list[j]. Manually hoisting the other stuff probably wasn't necessary.

Hoisting A[list[j]], rather than just list[j], is an artifact of a previous approach where I had the indexing wrong. As long as we hoist the load from list[j] as far as possible, the compiler can do a good job even if it doesn't know that list doesn't overlap C.

The inner loop, with gcc 5.3 targeting x86-64 -O3 -Wall -march=haswell -fopenmp (and -fverbose-asm) is:

.L26:
    vmovdqu ymm0, YMMWORD PTR [r8+rax]        # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B]
    vpaddd  ymm0, ymm0, YMMWORD PTR [rdx+rax]   # vect__71.75, MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: vectp.73_90, index: ivtmp.91_26, offset: 0B]
    add     r12d, 1   # ivtmp.88,
    vmovdqu YMMWORD PTR [r8+rax], ymm0        # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], vect__71.75
    add     rax, 32   # ivtmp.91,
    cmp     r12d, r9d # ivtmp.88, bnd.66
    jb      .L26        #,

So it's doing eight adds at once, with AVX2 vpaddd, with unaligned loads and an unaligned store back into C.

Since this is auto-vectorizing, it should make good code with ARM NEON, or PPC Altivec, or anything that can do packed 32bit addition.

I couldn't get gcc to tell me anything with -ftree-vectorizer-verbose=2, but clang's -Rpass-analysis=loop-vectorize was slightly helpful.

这篇关于做多维矩阵加法的更快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆