OpenMP通过三重for循环并行化矩阵乘法(性能问题) [英] OpenMP parallelizing matrix multiplication by a triple for loop (performance issue)
问题描述
我正在编写一个使用OpenMP进行矩阵乘法的程序,为方便缓存,该程序实现了A x B(转置)行X行而不是经典A x B行x列的乘法,以提高缓存效率.这样做时,我面临一个有趣的事实,即对我而言是不合逻辑的:如果在此代码中并行化extern循环,则该程序比将OpenMP指令置于最内部的循环中时要慢,在我的计算机中,时间是10.9 vs 8.1秒.
I'm writing a program for matrix multiplication with OpenMP, that, for cache convenience, implements the multiplication A x B(transpose) rows X rows instead of the classic A x B rows x columns, for better cache efficiency. Doing this I faced an interesting fact that for me is illogic: if in this code i parallelize the extern loop the program is slower than if I put the OpenMP directives in the most inner loop, in my computer the times are 10.9 vs 8.1 seconds.
//A and B are double* allocated with malloc, Nu is the lenght of the matrixes
//which are square
//#pragma omp parallel for
for (i=0; i<Nu; i++){
for (j=0; j<Nu; j++){
*(C+(i*Nu+j)) = 0.;
#pragma omp parallel for
for(k=0;k<Nu ;k++){
*(C+(i*Nu+j))+=*(A+(i*Nu+k)) * *(B+(j*Nu+k));//C(i,j)=sum(over k) A(i,k)*B(k,j)
}
}
}
推荐答案
尝试较少地点击结果.这会导致高速缓存行共享,并阻止操作并行运行.相反,使用局部变量将允许大部分写入操作在每个内核的L1缓存中进行.
Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.
此外,使用restrict
可能会有所帮助.否则,编译器无法保证对C
的写入不会更改A
和B
.
Also, use of restrict
may help. Otherwise the compiler can't guarantee that writes to C
aren't changing A
and B
.
尝试:
for (i=0; i<Nu; i++){
const double* const Arow = A + i*Nu;
double* const Crow = C + i*Nu;
#pragma omp parallel for
for (j=0; j<Nu; j++){
const double* const Bcol = B + j*Nu;
double sum = 0.0;
for(k=0;k<Nu ;k++){
sum += Arow[k] * Bcol[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
}
Crow[j] = sum;
}
}
此外,我认为Elalfer在并行化最内部循环时需要减少处理是正确的.
Also, I think Elalfer is right about needing reduction if you parallelize the innermost loop.
这篇关于OpenMP通过三重for循环并行化矩阵乘法(性能问题)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!