OpenMP的大幅减慢循环 [英] OpenMP drastically slows down for loop

查看:90
本文介绍了OpenMP的大幅减慢循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图加快这一与OpenMP并行化循环。我是IM pression,这应该在多个线程分割下工作的。然而,也许开销太大,这给我任何的加速。

我要指出,这个循环会发生很多很多很多次,循环的每个实例应该进行并行化。循环迭代,newNx的数量,可以像3小或大如256。不过,如果我有条件有它仅用于并行newNx> 100(仅最大环路),它仍然会减慢显著。

有没有什么在这里这将导致这比预期的慢?我还要提到的是向量A,V,B是非常大的,但访问是O(1)我相信。

 的#pragma OMP并行私人(J,K),共享(A,V,B)
    为(ⅰ= 1; I&下; = newNx; I + = 2){
      为(J = 1; J< = newNy; J ++){
        为(K = 1; K&下; = newNz; K + = 1){          nynz = newNy * newNz;          v [(I-1)* nynz +(J-1)* newNz + K] =
           - (ⅴ[(I-1)* nynz +(J-1)* newNz + K + 1 - 2 *(K / newNz)] * A - [((I-1)* nynz +(J-1)* newNz +(K-1))* SPN + KUP +奥法] +
          v [(I-1)* nynz +(J-1)* newNz + K-1 + 2 *(1 / K)] * A - [((I-1)* nynz +(J-1)* newNz +(K -1))* SPN + KDO +奥法] +
          v [(I-1)* nynz +(J - 2 *(焦耳/ newNy))* newNz + K] * A - [((I-1)* nynz +(J-1)* newNz +(K-1) )* SPN + JUP +奥法] +
          v [(I-1)* nynz +(J-2 + 2 *(1 / J))* newNz + K] * A - [((I-1)* nynz +(J-1)* newNz +(K- 1))* SPN + JDO +奥法] +
          v [(ⅰ - 2 *(I / newNx))* nynz +(J-1)* newNz + K] * A - [((I-1)* nynz +(J-1)* newNz +(K-1) )* SPN + IUP +奥法] +
          v [(ⅰ-2 + 2 *(1 / I))* nynz +(J-1)* newNz + K] * A - [((I-1)* nynz +(J-1)* newNz +(K- 1))* SPN + IDO +奥法] -
          B〔(ⅰ-1)* nynz +(J-1)* newNz + K])
          / A - [((I-1)* nynz +(J-1)* newNz +(K-1))* SPN + IFI +奥法];}}}


解决方案

假设你没有竞争条件,你可以尝试融合的循环。融合会给大块并行,这将有助于减少假共享的影响,并有可能分配负载更好。

有关三环路像这样

 的for(int I2 = 0; I2< X; I2 ++){
    为(中间体J2 = 0; J2&所述; Y; ​​J2 ++){
        为(中间体K2 = 0; K2&所述; Z; K2 ++){
            //
        }
    }
}

您可以融合像这样

 的#pragma OMP为平行
对于(INT N = 0; N≤(X * Y * Z); N ++){
    INT I2 = N /(Y * Z);
    INT J2 =(N%(Y * Z))/ Z;
    INT K2 =(N%(Y * Z))%Z;
    //
}

在你的情况,你可以做这样的

  INT I,J,K,N;
INT X = newNx%2? newNx / 2 + 1:newNx / 2;
INT Y = newNy;
INT Z = newNz;OMP的#pragma平行私人(I,J,K)
为(N = 0; N≤(X * Y * Z); N ++){
    我= 2 *(N /(Y * Z))+ 1;
    J =(N%(Y * Z))/ Z + 1;
    K =(N%(Y * Z))%Z + 1;
    // code休息
}

如果这个成功加快您的code,那么你可以感觉很好,你做了code更快,同时混淆甚至更远。

I am attempting to speed up this for loop with OpenMP parallelization. I was under the impression that this should split up the work across a number of threads. However, perhaps the overhead is too large for this to give me any speedup.

I should mention that this loop occurs many many many times, and each instance of the loop should be parallelized. The number of loop iterations, newNx, can be as small as 3 or as large as 256. However, if I conditionally have it parallelized only for newNx > 100 (only the largest loops), it still slows down significantly.

Is there anything in here which would cause this to be slower than anticipated? I should also mention that the vectors A,v,b are VERY large, but access is O(1) I believe.

    #pragma omp parallel for private(j,k),shared(A,v,b)
    for(i=1;i<=newNx;i+=2) {
      for(j=1;j<=newNy;j++) { 
        for(k=1;k<=newNz;k+=1) {

          nynz=newNy*newNz; 

          v[(i-1)*nynz+(j-1)*newNz+k] = 
          -(v[(i-1)*nynz+(j-1)*newNz+k+1 - 2*(k/newNz)]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + kup+offA] + 
          v[(i-1)*nynz+(j-1)*newNz+ k-1+2*(1/k)]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + kdo+offA] + 
          v[(i-1)*nynz+(j - 2*(j/newNy))*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + jup+offA] + 
          v[(i-1)*nynz+(j-2 + 2*(1/j))*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + jdo+offA] + 
          v[(i - 2*(i/newNx))*nynz+(j-1)*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + iup+offA] + 
          v[(i-2 + 2*(1/i))*nynz+(j-1)*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + ido+offA] - 
          b[(i-1)*nynz + (j-1)*newNz + k])
          /A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + ifi+offA];}}}

解决方案

Assuming you don't have a race condition you can try fusing the loops. Fusing will give larger chunks to parallelize which will help reduce the effect of false sharing and likely distribute the load better as well.

For a triple loop like this

for(int i2=0; i2<x; i2++) {
    for(int j2=0; j2<y; j2++) {
        for(int k2=0; k2<z; k2++) {
            //
        }
    }
}

you can fuse it like this

#pragma omp parallel for
for(int n=0; n<(x*y*z); n++) {
    int i2 = n/(y*z);
    int j2 = (n%(y*z))/z;
    int k2 = (n%(y*z))%z;
    //
}

In your case you you can do it like this

int i, j, k, n;
int x = newNx%2 ? newNx/2+1 : newNx/2;
int y = newNy;
int z = newNz;

#pragma omp parallel for private(i, j, k)
for(n=0; n<(x*y*z); n++) {
    i = 2*(n/(y*z)) + 1;
    j = (n%(y*z))/z + 1;
    k = (n%(y*z))%z + 1;
    // rest of code
}

If this successfully speed up your code then you can feel good that you made your code faster and at the same time obfuscated it even further.

这篇关于OpenMP的大幅减慢循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆