使用 OpenMP“for simd"在矩阵向量乘法中? [英] Using OpenMP "for simd" in matrix-vector multiplication?

查看:215
本文介绍了使用 OpenMP“for simd"在矩阵向量乘法中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正试图通过将 #pragma omp for#pragma omp simd 结合起来,使我的矩阵向量乘法函数与 BLAS 相媲美,但事实并非如此与仅使用 for 构造相比,获得任何加速改进.如何使用 OpenMP 的 SIMD 构造正确矢量化内循环?

I'm currently trying to get my matrix-vector multiplication function to compare favorably with BLAS by combining #pragma omp for with #pragma omp simd, but it's not getting any speedup improvement than if I were to just use the for construct. How do I properly vectorize the inner loop with OpenMP's SIMD construct?

vector dot(const matrix& A, const vector& x)
{
  assert(A.shape(1) == x.size());

  vector y = xt::zeros<double>({A.shape(0)});

  int i, j;
#pragma omp parallel shared(A, x, y) private(i, j)
  {
#pragma omp for // schedule(static)
    for (i = 0; i < y.size(); i++) { // row major
#pragma omp simd
      for (j = 0; j < x.size(); j++) {
        y(i) += A(i, j) * x(j);
      }
    }
  }

  return y;
}

推荐答案

您的指令不正确,因为会引入竞争条件(在 y(i) 上).在这种情况下,您应该使用归约.下面是一个例子:

Your directive is incorrect because there would introduce in a race condition (on y(i)). You should use a reduction in this case. Here is an example:

vector dot(const matrix& A, const vector& x)
{
  assert(A.shape(1) == x.size());

  vector y = xt::zeros<double>({A.shape(0)});

  int i, j;

  #pragma omp parallel shared(A, x, y) private(i, j)
  {
    #pragma omp for // schedule(static)
    for (i = 0; i < y.size(); i++) { // row major
      decltype(y(0)) sum = 0;

      #pragma omp simd reduction(+:sum)
      for (j = 0; j < x.size(); j++) {
        sum += A(i, j) * x(j);
      }

      y(i) += sum;
    }
  }

  return y;
}

请注意,可能不需要更快,因为某些编译器能够自动矢量化代码(例如 ICC).GCC 和 Clang 经常无法自动执行(高级)SIMD 缩减,这样的指令对他们有点帮助.您可以检查汇编代码以检查代码如何矢量化或启用矢量化报告(请参阅 此处 GCC).

Note that it may not be necessary faster because some compilers are able to automatically vectorize the code (ICC for example). GCC and Clang often fail to perform (advanced) SIMD reductions automatically and such a directive help them a bit. You can check the assembly code to check how the code is vectorized or enable vectorization reports (see here for GCC).

这篇关于使用 OpenMP“for simd"在矩阵向量乘法中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆