OpenMP C ++矩阵乘法并行运行较慢 [英] OpenMP C++ Matrix Multiplication run slower in parallel
问题描述
我正在学习使用OpenMP并行执行for循环的基础知识。
I'm learning the basics of paralel execution of for loop using OpenMP.
很遗憾,我的paralel程序比串行版本慢10倍。
我做错了什么?我缺少一些障碍?
Sadly, my paralel program runs 10x slower than serial version. What am I doing wrong? Am I missing some barriers?
double **basicMultiply(double **A, double **B, int size) {
int i, j, k;
double **res = createMatrix(size);
omp_set_num_threads(4);
#pragma omp parallel for private(k)
for (i = 0; i < size; i++) {
for (j = 0; j < size; j++) {
for (k = 0; k < size; k++) {
res[i][j] += A[i][k] * B[k][j];
}
}
}
return res;
}
非常感谢!
推荐答案
您的问题是由于内部循环变量 j
的竞争条件。它需要被私人化。
Your problem is due to a race condition on the inner loop variable j
. It needs to be made private.
对于C89,我会这样做:
For C89 I would do something like this:
#pragma omp parallel
{
int i, j, k;
#pragma omp for
for(i=0; ...
对于C ++或C99使用混合声明
For C++ or C99 use mixed declarations
#pragma omp parallel for
for(int i=0; ...
这样做不需要显式地声明任何共享或私有。
Doing this you don't have to explicitly declare anything shared or private.
对代码进一步注释当你执行 B [k] [j]
时,你的单线程代码不是缓存友好的缓存行然后移动到下一个缓存行等等,直到点积被完成,其他缓存行已被赶出,而是你应该首先转置并访问 BT [j] [k ]
Some further comments to your code. Your single threaded code is not cache friendly when you do B[k][j]
. This reads a cacheline then moves to the next cache line and so forth until the dot product is done by which time the other cachelines have been evicted. Instead you should take the transpose first and access as BT[j][k]
. Additionally, you have allocated arrays of arrays and not one contiguous 2D array. I fixed your code to use the transpose and a contiguous 2D array.
以下是size = 512的时候。
Here are the times I get for size=512.
no transpose no openmp 0.94s
no transpose, openmp 0.23s
tranpose, no openmp 0.27s
transpose, openmp 0.08s
以下是代码(另见 http://coliru.stacked-crooked.com/a/ee174916fa035f97 ) )
Below is the code (also see http://coliru.stacked-crooked.com/a/ee174916fa035f97)
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void transpose(double *A, double *B, int n) {
int i,j;
for(i=0; i<n; i++) {
for(j=0; j<n; j++) {
B[j*n+i] = A[i*n+j];
}
}
}
void gemm(double *A, double *B, double *C, int n)
{
int i, j, k;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
double dot = 0;
for (k = 0; k < n; k++) {
dot += A[i*n+k]*B[k*n+j];
}
C[i*n+j ] = dot;
}
}
}
void gemm_omp(double *A, double *B, double *C, int n)
{
#pragma omp parallel
{
int i, j, k;
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
double dot = 0;
for (k = 0; k < n; k++) {
dot += A[i*n+k]*B[k*n+j];
}
C[i*n+j ] = dot;
}
}
}
}
void gemmT(double *A, double *B, double *C, int n)
{
int i, j, k;
double *B2;
B2 = (double*)malloc(sizeof(double)*n*n);
transpose(B,B2, n);
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
double dot = 0;
for (k = 0; k < n; k++) {
dot += A[i*n+k]*B2[j*n+k];
}
C[i*n+j ] = dot;
}
}
free(B2);
}
void gemmT_omp(double *A, double *B, double *C, int n)
{
double *B2;
B2 = (double*)malloc(sizeof(double)*n*n);
transpose(B,B2, n);
#pragma omp parallel
{
int i, j, k;
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
double dot = 0;
for (k = 0; k < n; k++) {
dot += A[i*n+k]*B2[j*n+k];
}
C[i*n+j ] = dot;
}
}
}
free(B2);
}
int main() {
int i, n;
double *A, *B, *C, dtime;
n=512;
A = (double*)malloc(sizeof(double)*n*n);
B = (double*)malloc(sizeof(double)*n*n);
C = (double*)malloc(sizeof(double)*n*n);
for(i=0; i<n*n; i++) { A[i] = rand()/RAND_MAX; B[i] = rand()/RAND_MAX;}
dtime = omp_get_wtime();
gemm(A,B,C, n);
dtime = omp_get_wtime() - dtime;
printf("%f\n", dtime);
dtime = omp_get_wtime();
gemm_omp(A,B,C, n);
dtime = omp_get_wtime() - dtime;
printf("%f\n", dtime);
dtime = omp_get_wtime();
gemmT(A,B,C, n);
dtime = omp_get_wtime() - dtime;
printf("%f\n", dtime);
dtime = omp_get_wtime();
gemmT_omp(A,B,C, n);
dtime = omp_get_wtime() - dtime;
printf("%f\n", dtime);
return 0;
}
这篇关于OpenMP C ++矩阵乘法并行运行较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!