按列和使用OpenMP并行行矩阵次矢量 [英] Parallelizing matrix times a vector by columns and by rows with OpenMP

查看:112
本文介绍了按列和使用OpenMP并行行矩阵次矢量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于一些功课我有,我需要通过一个载体来实现矩阵乘法,按行和按列并行的。我明白了行版本,但我在列版本有点困惑。

假设我们有如下数据:

而code行版本:

 的#pragma共享OMP并行默认值(无)(I,V2,V1,矩阵,TAM)私人(J)
  {
OMP的#pragma为
    对于(i = 0; I< TAM;我++)
      为(J = 0; J< TAM; J ++){
//输出(Hebra%d个hizo%D,%d个\\ N,omp_get_thread_num(),I,J);
        V2 [I] + =矩阵[i] [j]的* V1 [J]。
      }
  }

下面的计算完成权,结果是正确的。

列版本:

 的#pragma共享OMP并行默认值(无)(J,V2,V1,矩阵,TAM)私人(I)
  {
    对于(i = 0; I< TAM;我++)
OMP的#pragma为
      为(J = 0; J< TAM; J ++){
//输出(Hebra%d个hizo%D,%d个\\ N,omp_get_thread_num(),I,J);
        V2 [I] + =矩阵[i] [j]的* V1 [J]。
      }
  }

下面,由于并行化是如何完成的,其结果就取决于谁线程执行每个列中的每个执行不同而不同。但它发生一些有趣的事情,(我会认为是因为编译器优化),如果我取消注释的printf 然后把结果都一样行版本,因此,正确的,例如:

  0螺纹做0,0
线程2做0,2
线程1做了0,1
线程2做1,2
线程1做了1,1
线程0做1,0
线程2做2,2
线程1做了2,1
线程0做2,0 2.000000 3.000000 4.000000
 3.000000 4.000000 5.000000
 4.000000 5.000000 6.000000
V2:
20.000000,26.000000,32.000000,

是正确的,但如果我删除的printf:

  V2:
18.000000,11.000000,28.000000,

我应该用什么样的机制来得到列的版本吧?

注意:我更在乎的解释,而不是code,你可以张贴作为回答,因为我真正想要的是明白是什么在列版本脚麻

修改

我发现的摆脱他的回答由Z玻色子提出的私人载体的一种方式。我换成一个变量向量,这里是code:

 的#pragma OMP并行
      {
        双sLocal = 0;
        INT I,J;
        对于(i = 0; I< TAM;我++){
    OMP的#pragma为
          为(J = 0; J< TAM; J ++){
            sLocal + =矩阵[i] [j]的* V1 [J]。
          }
    OMP的#pragma关键
          {
            V2 [I] + = sLocal;
            sLocal = 0;
          }
        }
      }


解决方案

我不知道你的家庭作业是指什么沿行和列并行,但我知道为什么你的code不工作。当你写 V2 [I] 你有一个竞争状态。您可以通过专用版本的V2 [I] ,并行填充它们,然后用一个关键部分合并来解决这个问题。

 的#pragma OMP并行
{
    浮v2_private [TAM] = {};
    INT I,J;
    对于(i = 0; I< TAM;我++){
        OMP的#pragma为
        为(J = 0; J< TAM; J ++){
            v2_private [I] + =矩阵[i] [j]的* V1 [J]。
        }
    }
    OMP的#pragma关键
    {
        对于(i = 0; I< TAM;我++)V2 [I] + = v2_private [I]
    }
}

我测试。你可以看到这里的结果 http://coliru.stacked-crooked.com/a/5ad4153f9579304d

请注意,我并​​没有明确定义什么共用或私人。这是没有必要的事情。有些人认为,你应该明确地定义了一切。我认为动产相反。通过定义 I Ĵ(和 v2_private )内并行段,他们是由私有的。

For some homework I have, I need to implement the multiplication of a matrix by a vector, parallelizing it by rows and by columns. I do understand the row version, but I am a little confused in the column version.

Lets say we have the following data:

And the code for the row version:

#pragma omp parallel default(none) shared(i,v2,v1,matrix,tam) private(j)
  {
#pragma omp for
    for (i = 0; i < tam; i++)
      for (j = 0; j < tam; j++){
//        printf("Hebra %d hizo %d,%d\n", omp_get_thread_num(), i, j);
        v2[i] += matrix[i][j] * v1[j];
      }
  }

Here the calculations are done right and the result is correct.

The column version:

#pragma omp parallel default(none) shared(j,v2,v1,matrix,tam) private(i)
  {
    for (i = 0; i < tam; i++)
#pragma omp for
      for (j = 0; j < tam; j++) {
//            printf("Hebra %d hizo %d,%d\n", omp_get_thread_num(), i, j);
        v2[i] += matrix[i][j] * v1[j];
      }
  }

Here, due to how the parallelization is done, the result varies on each execution depending on who thread execute each column. But it happens something interesting, (And I would think is because of compiler optimizations) if I uncomment the printf then the results all the same as the row version and therefore, correct, for example:

Thread 0 did 0,0
Thread 2 did 0,2
Thread 1 did 0,1
Thread 2 did 1,2
Thread 1 did 1,1
Thread 0 did 1,0
Thread 2 did 2,2
Thread 1 did 2,1
Thread 0 did 2,0

 2.000000  3.000000  4.000000 
 3.000000  4.000000  5.000000 
 4.000000  5.000000  6.000000 


V2:
20.000000, 26.000000, 32.000000,

Is right, but If I remove the printf:

V2:
18.000000, 11.000000, 28.000000,

What kind of mechanism should I use to get the column version right?

Note: I care more about the explanation rather than the code you may post as answer, because what I really want is understand what is going wrong in the column version.

EDIT

I've found a way of get rid of the private vector proposed by Z boson in his answer. I've replaced that vector by a variable, here is the code:

    #pragma omp parallel
      {
        double sLocal = 0;
        int i, j;
        for (i = 0; i < tam; i++) {
    #pragma omp for
          for (j = 0; j < tam; j++) {
            sLocal += matrix[i][j] * v1[j];
          }
    #pragma omp critical
          {
            v2[i] += sLocal;
            sLocal = 0;
          }
        }
      }

解决方案

I don't know exactly what your homework means by parallelize along row and column but I know why your code is not working. You have a race condition when you write to v2[i]. You can fix it by making private versions of v2[i], filling them in parallel, and then merging them with a critical section.

#pragma omp parallel
{
    float v2_private[tam] = {};
    int i,j;
    for (i = 0; i < tam; i++) {
        #pragma omp for
        for (j = 0; j < tam; j++) {
            v2_private[i] += matrix[i][j] * v1[j];
        }
    }
    #pragma omp critical
    {
        for(i=0; i<tam; i++) v2[i] += v2_private[i];
    }
}

I tested this. You can see the results here http://coliru.stacked-crooked.com/a/5ad4153f9579304d

Note that I did not explicitly define anything shared or private. It's not necessary to do. Some people think you should explicitly define everything. I personalty think the opposite. By defining i and j (and v2_private) inside the parallel section they are made private.

这篇关于按列和使用OpenMP并行行矩阵次矢量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆