按列和使用OpenMP并行行矩阵次矢量 [英] Parallelizing matrix times a vector by columns and by rows with OpenMP
问题描述
对于一些功课我有,我需要通过一个载体来实现矩阵乘法,按行和按列并行的。我明白了行版本,但我在列版本有点困惑。
假设我们有如下数据:
而code行版本:
的#pragma共享OMP并行默认值(无)(I,V2,V1,矩阵,TAM)私人(J)
{
OMP的#pragma为
对于(i = 0; I< TAM;我++)
为(J = 0; J< TAM; J ++){
//输出(Hebra%d个hizo%D,%d个\\ N,omp_get_thread_num(),I,J);
V2 [I] + =矩阵[i] [j]的* V1 [J]。
}
}
下面的计算完成权,结果是正确的。
列版本:
的#pragma共享OMP并行默认值(无)(J,V2,V1,矩阵,TAM)私人(I)
{
对于(i = 0; I< TAM;我++)
OMP的#pragma为
为(J = 0; J< TAM; J ++){
//输出(Hebra%d个hizo%D,%d个\\ N,omp_get_thread_num(),I,J);
V2 [I] + =矩阵[i] [j]的* V1 [J]。
}
}
下面,由于并行化是如何完成的,其结果就取决于谁线程执行每个列中的每个执行不同而不同。但它发生一些有趣的事情,(我会认为是因为编译器优化),如果我取消注释的printf
然后把结果都一样行版本,因此,正确的,例如:
0螺纹做0,0
线程2做0,2
线程1做了0,1
线程2做1,2
线程1做了1,1
线程0做1,0
线程2做2,2
线程1做了2,1
线程0做2,0 2.000000 3.000000 4.000000
3.000000 4.000000 5.000000
4.000000 5.000000 6.000000
V2:
20.000000,26.000000,32.000000,
是正确的,但如果我删除的printf:
V2:
18.000000,11.000000,28.000000,
我应该用什么样的机制来得到列的版本吧?
注意:我更在乎的解释,而不是code,你可以张贴作为回答,因为我真正想要的是明白是什么在列版本脚麻
修改
我发现的摆脱他的回答由Z玻色子提出的私人载体的一种方式。我换成一个变量向量,这里是code:
的#pragma OMP并行
{
双sLocal = 0;
INT I,J;
对于(i = 0; I< TAM;我++){
OMP的#pragma为
为(J = 0; J< TAM; J ++){
sLocal + =矩阵[i] [j]的* V1 [J]。
}
OMP的#pragma关键
{
V2 [I] + = sLocal;
sLocal = 0;
}
}
}
我不知道你的家庭作业是指什么沿行和列并行,但我知道为什么你的code不工作。当你写 V2 [I]
你有一个竞争状态。您可以通过专用版本的V2 [I]
,并行填充它们,然后用一个关键部分合并来解决这个问题。
的#pragma OMP并行
{
浮v2_private [TAM] = {};
INT I,J;
对于(i = 0; I< TAM;我++){
OMP的#pragma为
为(J = 0; J< TAM; J ++){
v2_private [I] + =矩阵[i] [j]的* V1 [J]。
}
}
OMP的#pragma关键
{
对于(i = 0; I< TAM;我++)V2 [I] + = v2_private [I]
}
}
我测试。你可以看到这里的结果 http://coliru.stacked-crooked.com/a/5ad4153f9579304d一>
请注意,我并没有明确定义什么共用或私人。这是没有必要的事情。有些人认为,你应该明确地定义了一切。我认为动产相反。通过定义 I
和Ĵ
(和 v2_private
)内并行段,他们是由私有的。
For some homework I have, I need to implement the multiplication of a matrix by a vector, parallelizing it by rows and by columns. I do understand the row version, but I am a little confused in the column version.
Lets say we have the following data:
And the code for the row version:
#pragma omp parallel default(none) shared(i,v2,v1,matrix,tam) private(j)
{
#pragma omp for
for (i = 0; i < tam; i++)
for (j = 0; j < tam; j++){
// printf("Hebra %d hizo %d,%d\n", omp_get_thread_num(), i, j);
v2[i] += matrix[i][j] * v1[j];
}
}
Here the calculations are done right and the result is correct.
The column version:
#pragma omp parallel default(none) shared(j,v2,v1,matrix,tam) private(i)
{
for (i = 0; i < tam; i++)
#pragma omp for
for (j = 0; j < tam; j++) {
// printf("Hebra %d hizo %d,%d\n", omp_get_thread_num(), i, j);
v2[i] += matrix[i][j] * v1[j];
}
}
Here, due to how the parallelization is done, the result varies on each execution depending on who thread execute each column. But it happens something interesting, (And I would think is because of compiler optimizations) if I uncomment the printf
then the results all the same as the row version and therefore, correct, for example:
Thread 0 did 0,0
Thread 2 did 0,2
Thread 1 did 0,1
Thread 2 did 1,2
Thread 1 did 1,1
Thread 0 did 1,0
Thread 2 did 2,2
Thread 1 did 2,1
Thread 0 did 2,0
2.000000 3.000000 4.000000
3.000000 4.000000 5.000000
4.000000 5.000000 6.000000
V2:
20.000000, 26.000000, 32.000000,
Is right, but If I remove the printf:
V2:
18.000000, 11.000000, 28.000000,
What kind of mechanism should I use to get the column version right?
Note: I care more about the explanation rather than the code you may post as answer, because what I really want is understand what is going wrong in the column version.
EDIT
I've found a way of get rid of the private vector proposed by Z boson in his answer. I've replaced that vector by a variable, here is the code:
#pragma omp parallel
{
double sLocal = 0;
int i, j;
for (i = 0; i < tam; i++) {
#pragma omp for
for (j = 0; j < tam; j++) {
sLocal += matrix[i][j] * v1[j];
}
#pragma omp critical
{
v2[i] += sLocal;
sLocal = 0;
}
}
}
I don't know exactly what your homework means by parallelize along row and column but I know why your code is not working. You have a race condition when you write to v2[i]
. You can fix it by making private versions of v2[i]
, filling them in parallel, and then merging them with a critical section.
#pragma omp parallel
{
float v2_private[tam] = {};
int i,j;
for (i = 0; i < tam; i++) {
#pragma omp for
for (j = 0; j < tam; j++) {
v2_private[i] += matrix[i][j] * v1[j];
}
}
#pragma omp critical
{
for(i=0; i<tam; i++) v2[i] += v2_private[i];
}
}
I tested this. You can see the results here http://coliru.stacked-crooked.com/a/5ad4153f9579304d
Note that I did not explicitly define anything shared or private. It's not necessary to do. Some people think you should explicitly define everything. I personalty think the opposite. By defining i
and j
(and v2_private
) inside the parallel section they are made private.
这篇关于按列和使用OpenMP并行行矩阵次矢量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!