OpenMP C ++矩阵乘法并行运行较慢 [英] OpenMP C++ Matrix Multiplication run slower in parallel

查看:920
本文介绍了OpenMP C ++矩阵乘法并行运行较慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习使用OpenMP并行执行for循环的基础知识。

I'm learning the basics of paralel execution of for loop using OpenMP.

很遗憾,我的paralel程序比串行版本慢10倍。
我做错了什么?我缺少一些障碍?

Sadly, my paralel program runs 10x slower than serial version. What am I doing wrong? Am I missing some barriers?

double **basicMultiply(double **A, double **B, int size) {
   int i, j, k;
   double **res = createMatrix(size);
   omp_set_num_threads(4);
   #pragma omp parallel for private(k)
   for (i = 0; i < size; i++) {
      for (j = 0; j < size; j++) {
         for (k = 0; k < size; k++) {
            res[i][j] += A[i][k] * B[k][j];
         }
      }
   }
   return res;
}

非常感谢!

推荐答案

您的问题是由于内部循环变量 j 的竞争条件。它需要被私人化。

Your problem is due to a race condition on the inner loop variable j. It needs to be made private.

对于C89,我会这样做:

For C89 I would do something like this:

#pragma omp parallel
{
    int i, j, k;
    #pragma omp for
    for(i=0; ...

对于C ++或C99使用混合声明

For C++ or C99 use mixed declarations

#pragma omp parallel for
for(int i=0; ...

这样做不需要显式地声明任何共享或私有。

Doing this you don't have to explicitly declare anything shared or private.

对代码进一步注释当你执行 B [k] [j] 时,你的单线程代码不是缓存友好的缓存行然后移动到下一个缓存行等等,直到点积被完成,其他缓存行已被赶出,而是你应该首先转置并访问 BT [j] [k ]

Some further comments to your code. Your single threaded code is not cache friendly when you do B[k][j]. This reads a cacheline then moves to the next cache line and so forth until the dot product is done by which time the other cachelines have been evicted. Instead you should take the transpose first and access as BT[j][k]. Additionally, you have allocated arrays of arrays and not one contiguous 2D array. I fixed your code to use the transpose and a contiguous 2D array.

以下是size = 512的时候。

Here are the times I get for size=512.

no transpose  no openmp 0.94s
no transpose, openmp    0.23s
tranpose, no openmp     0.27s
transpose, openmp       0.08s

以下是代码(另见 http://coliru.stacked-crooked.com/a/ee174916fa035f97 ) )

Below is the code (also see http://coliru.stacked-crooked.com/a/ee174916fa035f97)

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

void transpose(double *A, double *B, int n) {
    int i,j;
    for(i=0; i<n; i++) {
        for(j=0; j<n; j++) {
            B[j*n+i] = A[i*n+j];
        }
    }
}

void gemm(double *A, double *B, double *C, int n) 
{   
    int i, j, k;
    for (i = 0; i < n; i++) { 
        for (j = 0; j < n; j++) {
            double dot  = 0;
            for (k = 0; k < n; k++) {
                dot += A[i*n+k]*B[k*n+j];
            } 
            C[i*n+j ] = dot;
        }
    }
}

void gemm_omp(double *A, double *B, double *C, int n) 
{   
    #pragma omp parallel
    {
        int i, j, k;
        #pragma omp for
        for (i = 0; i < n; i++) { 
            for (j = 0; j < n; j++) {
                double dot  = 0;
                for (k = 0; k < n; k++) {
                    dot += A[i*n+k]*B[k*n+j];
                } 
                C[i*n+j ] = dot;
            }
        }

    }
}

void gemmT(double *A, double *B, double *C, int n) 
{   
    int i, j, k;
    double *B2;
    B2 = (double*)malloc(sizeof(double)*n*n);
    transpose(B,B2, n);
    for (i = 0; i < n; i++) { 
        for (j = 0; j < n; j++) {
            double dot  = 0;
            for (k = 0; k < n; k++) {
                dot += A[i*n+k]*B2[j*n+k];
            } 
            C[i*n+j ] = dot;
        }
    }
    free(B2);
}

void gemmT_omp(double *A, double *B, double *C, int n) 
{   
    double *B2;
    B2 = (double*)malloc(sizeof(double)*n*n);
    transpose(B,B2, n);
    #pragma omp parallel
    {
        int i, j, k;
        #pragma omp for
        for (i = 0; i < n; i++) { 
            for (j = 0; j < n; j++) {
                double dot  = 0;
                for (k = 0; k < n; k++) {
                    dot += A[i*n+k]*B2[j*n+k];
                } 
                C[i*n+j ] = dot;
            }
        }

    }
    free(B2);
}

int main() {
    int i, n;
    double *A, *B, *C, dtime;

    n=512;
    A = (double*)malloc(sizeof(double)*n*n);
    B = (double*)malloc(sizeof(double)*n*n);
    C = (double*)malloc(sizeof(double)*n*n);
    for(i=0; i<n*n; i++) { A[i] = rand()/RAND_MAX; B[i] = rand()/RAND_MAX;}

    dtime = omp_get_wtime();
    gemm(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemm_omp(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemmT(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemmT_omp(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    return 0;

}

这篇关于OpenMP C ++矩阵乘法并行运行较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆