如何完美地同时使用openmp和AVX2? [英] How can I use openmp and AVX2 simultaneously with perfect answer?

查看:100
本文介绍了如何完美地同时使用openmp和AVX2?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用OpenMP和AVX2编写了Matrix-Vector产品程序.

I wrote the Matrix-Vector product program using OpenMP and AVX2.

但是,由于OpenMP,我得到了错误的答案. 真正的答案是数组c的所有值都将变为100.

However, I got the wrong answer because of OpenMP. The true answer is all of the value of array c would become 100.

我的答案是98、99和100的组合.

My answer was mix of 98, 99, and 100.

实际代码如下.

我用-fopenmp,-mavx,-mfma编译了Clang.

I compiled Clang with -fopenmp, -mavx, -mfma.

#include "stdio.h"
#include "math.h"
#include "stdlib.h"
#include "omp.h"
#include "x86intrin.h"

void mv(double *a,double *b,double *c, int m, int n, int l)
{
    int k;
#pragma omp parallel
    {
        __m256d va,vb,vc;
        int i;
#pragma omp for private(i, va, vb, vc) schedule(static)
        for (k = 0; k < l; k++) {
            vb = _mm256_broadcast_sd(&b[k]);
            for (i = 0; i < m; i+=4) {
                va = _mm256_loadu_pd(&a[m*k+i]);
                vc = _mm256_loadu_pd(&c[i]);

                vc = _mm256_fmadd_pd(vc, va, vb);

                _mm256_storeu_pd( &c[i], vc );
            }
        }
    }
}
int main(int argc, char* argv[]) {

    // set variables
    int m;
    double* a;
    double* b;
    double* c;
    int i;

    m=100;
    // main program

    // set vector or matrix
    a=(double *)malloc(sizeof(double) * m*m);
    b=(double *)malloc(sizeof(double) * m*1);
    c=(double *)malloc(sizeof(double) * m*1);
    //preset
    for (i=0;i<m;i++) {
        a[i]=1;
        b[i]=1;
        c[i]=0.0;
    }
    for (i=m;i<m*m;i++) {
        a[i]=1;
    }

    mv(a, b, c, m, 1, m);

    for (i=0;i<m;i++) {
        printf("%e\n", c[i]);
    }
    free(a);
    free(b);
    free(c);
    return 0;
}

我知道关键部分会有所帮助.但是临界区很慢.

I know critical section would help. However critical section was slow.

那么,我该如何解决这个问题?

So, how can I solve the problem?

推荐答案

您想要的基本操作是

c[i] = a[i,k]*b[k]

如果您使用行主要订单存储,它将变为

c[i] = a[i*l + k]*b[k]

如果您使用列主要订单存储,它将变为

If you use column-major order storage this becomes

c[i] = a[k*m + i]*b[k]

对于大行顺序,您可以像这样并行化

For row-major order you can parallelize like this

#pragma omp parallel for
for(int i=0; i<m; i++) {
  for(int k=0; k<l; k++) {
    c[i] += a[i*l+k]*b[k];
  }
}

对于列大订单,您可以像这样并行化

For column-major order you can parallelize like this

#pragma omp parallel
for(int k=0; k<l; k++) {
  #pragma omp for
  for(int i=0; i<m; i++) {
    c[i] += a[k*m+i]*b[k];
  }
}

矩阵向量操作是2级操作,它们是内存带宽绑定操作. 1级和2级操作无法根据内核数量进行扩展.只有3级操作(例如,密集矩阵乘法)可以缩放 https://en.wikipedia. org/wiki/Basic_Linear_Algebra_Subprograms#Level_3 .

Matrix-vector operations are Level 2 operations which are memory bandwidth bound operation. The Level 1 and Level 2 operations don't scale e.g with the number of cores. It's only the Level 3 operations (e.g. dense matrix multiplication) which scale https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_3.

这篇关于如何完美地同时使用openmp和AVX2?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆