OpenMP通过三重for循环并行化矩阵乘法(性能问题) [英] OpenMP parallelizing matrix multiplication by a triple for loop (performance issue)

查看:622
本文介绍了OpenMP通过三重for循环并行化矩阵乘法(性能问题)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个使用OpenMP进行矩阵乘法的程序,为方便缓存,该程序实现了A x B(转置)行X行而不是经典A x B行x列的乘法,以提高缓存效率.这样做时,我面临一个有趣的事实,即对我而言是不合逻辑的:如果在此代码中并行化extern循环,则该程序比将OpenMP指令置于最内部的循环中时要慢,在我的计算机中,时间是10.9 vs 8.1秒.

I'm writing a program for matrix multiplication with OpenMP, that, for cache convenience, implements the multiplication A x B(transpose) rows X rows instead of the classic A x B rows x columns, for better cache efficiency. Doing this I faced an interesting fact that for me is illogic: if in this code i parallelize the extern loop the program is slower than if I put the OpenMP directives in the most inner loop, in my computer the times are 10.9 vs 8.1 seconds.

//A and B are double* allocated with malloc, Nu is the lenght of the matrixes 
//which are square

//#pragma omp parallel for
for (i=0; i<Nu; i++){
  for (j=0; j<Nu; j++){
    *(C+(i*Nu+j)) = 0.;
#pragma omp parallel for
    for(k=0;k<Nu ;k++){
      *(C+(i*Nu+j))+=*(A+(i*Nu+k)) * *(B+(j*Nu+k));//C(i,j)=sum(over k) A(i,k)*B(k,j)
    }
  }
}

推荐答案

尝试较少地点击结果.这会导致高速缓存行共享,并阻止操作并行运行.相反,使用局部变量将允许大部分写入操作在每个内核的L1缓存中进行.

Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.

此外,使用restrict可能会有所帮助.否则,编译器无法保证对C的写入不会更改AB.

Also, use of restrict may help. Otherwise the compiler can't guarantee that writes to C aren't changing A and B.

尝试:

for (i=0; i<Nu; i++){
  const double* const Arow = A + i*Nu;
  double* const Crow = C + i*Nu;
#pragma omp parallel for
  for (j=0; j<Nu; j++){
    const double* const Bcol = B + j*Nu;
    double sum = 0.0;
    for(k=0;k<Nu ;k++){
      sum += Arow[k] * Bcol[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
    }
    Crow[j] = sum;
  }
}

此外,我认为Elalfer在并行化最内部循环时需要减少处理是正确的.

Also, I think Elalfer is right about needing reduction if you parallelize the innermost loop.

这篇关于OpenMP通过三重for循环并行化矩阵乘法(性能问题)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆