是否可以并行化此for循环? [英] Is it possible to parallelize this for loop?

查看:89
本文介绍了是否可以并行化此for循环?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了一些使用OpenMP进行并行处理的代码,并且在各种函数调用中,我注意到for循环在计算时间上有一些内.

I was given some code to paralellize using OpenMP and, among the various function calls, I noticed this for loop takes some good guilt on the computation time.

  double U[n][n];
  double L[n][n];
  double Aprime[n][n];
  for(i=0; i<n; i++) {
    for(j=0; j<n; j++) {
      if (j <= i) {
          double s;
          s=0;
          for(k=0; k<j; k++) {
            s += L[j][k] * U[k][i];
          } 
          U[j][i] = Aprime[j][i] - s;
      } else if (j >= i) {
          double s;
          s=0;
          for(k=0; k<i; k++) {
            s += L[j][k] * U[k][i];
          }
          L[j][i] = (Aprime[j][i] - s) / U[i][i];
      }
    }

但是,在尝试对其进行并行化并在各处应用信号量(没有运气)后,我意识到else if条件对早期if的依赖性很大(L[j][i]是用U[i][i]处理的数字,该数字可能会在早期的if上设置),在我看来,由于种族条件,它是不可并行的.

However, after trying to parallelize it and applying some semaphores here and there (with no luck), I came to the realization that the else if condition has a strong dependency on the early if (L[j][i] being a processed number with U[i][i], which may be set on the early if), making it, in my oppinion, non-parallelizable due to race conditions.

是否有可能使这种代码并行化,以使else if仅在更早的if已经完成时才执行?

Is it possible to parallelize this code in such a manner to make the else if only be executed if the earlier if has already completed?

推荐答案

在尝试并行处理之前,请先尝试进行简化.

Before trying to parallelize things, try simplification first.

例如,可以完全消除if.

此外,代码以导致最差缓存性能的方式访问矩阵.这可能是实际瓶颈.

Also, the code is accessing the matrixes in a way that causes worst cache performance. That may be the real bottleneck.

注意:在下面的更新#3中,我进行了基准测试,并且更新版本2的缓存友好版本fix5比原始版本的性能高3.9倍.

Note: In update #3 below, I did benchmarks and the cache friendly version fix5, from update #2, outperforms the original by 3.9x.

我已经分阶段进行了清理,因此您可以看到代码转换.

I've cleaned things up in stages, so you can see the code transformations.

这样,应该可以成功添加omp指令.正如我在最高评价中提到的那样,变量的全局与函数作用域会影响可能需要的更新类型(例如omp atomic update等)

With this, it should be possible to add omp directives successfully. As I mentioned in my top comment, the global vs. function scope of the variables affects the type of update that may be required (e.g. omp atomic update, etc.)

作为参考,这是您的原始代码:

For reference, here is your original code:

double U[n][n];
double L[n][n];
double Aprime[n][n];

for (i = 0; i < n; i++) {
    for (j = 0; j < n; j++) {
        if (j <= i) {
            double s;

            s = 0;
            for (k = 0; k < j; k++) {
                s += L[j][k] * U[k][i];
            }
            U[j][i] = Aprime[j][i] - s;
        }
        else if (j >= i) {
            double s;

            s = 0;
            for (k = 0; k < i; k++) {
                s += L[j][k] * U[k][i];
            }
            L[j][i] = (Aprime[j][i] - s) / U[i][i];
        }
    }
}


else if (j >= i)是不必要的,可以仅用else代替.但是,我们可以将j循环分为两个循环,这样就不需要if/else:


The else if (j >= i) was unnecessary and could be replaced with just else. But, we can split the j loop into two loops so that neither needs an if/else:

// fix2.c -- split up j's loop to eliminate if/else inside

double U[n][n];
double L[n][n];
double Aprime[n][n];

for (i = 0; i < n; i++) {
    for (j = 0; j <= i; j++) {
        double s = 0;
        for (k = 0; k < j; k++)
            s += L[j][k] * U[k][i];
        U[j][i] = Aprime[j][i] - s;
    }

    for (; j < n; j++) {
        double s = 0;
        for (k = 0; k < i; k++)
            s += L[j][k] * U[k][i];
        L[j][i] = (Aprime[j][i] - s) / U[i][i];
    }
}


U[i][i]在第二个j循环中是不变的,因此我们可以保留它:


U[i][i] is invariant in the second j loop, so we can presave it:

// fix3.c -- save off value of U[i][i]

double U[n][n];
double L[n][n];
double Aprime[n][n];

for (i = 0; i < n; i++) {
    for (j = 0; j <= i; j++) {
        double s = 0;
        for (k = 0; k < j; k++)
            s += L[j][k] * U[k][i];
        U[j][i] = Aprime[j][i] - s;
    }

    double Uii = U[i][i];

    for (; j < n; j++) {
        double s = 0;
        for (k = 0; k < i; k++)
            s += L[j][k] * U[k][i];
        L[j][i] = (Aprime[j][i] - s) / Uii;
    }
}


对矩阵的访问可能以最糟糕的方式来提高缓存性能.因此,如果可以翻转尺寸的分配,则可以在内存访问上节省大量资金:


The access to the matrixes is done in probably the worst way for cache performance. So, if the assignment of dimensions can be flipped, a substantial savings in memory access can be achieved:

// fix4.c -- transpose matrix coordinates to get _much_ better memory/cache
// performance

double U[n][n];
double L[n][n];
double Aprime[n][n];

for (i = 0; i < n; i++) {
    for (j = 0; j <= i; j++) {
        double s = 0;
        for (k = 0; k < j; k++)
            s += L[k][j] * U[i][k];
        U[i][j] = Aprime[i][j] - s;
    }

    double Uii = U[i][i];

    for (; j < n; j++) {
        double s = 0;
        for (k = 0; k < i; k++)
            s += L[k][j] * U[i][k];
        L[i][j] = (Aprime[i][j] - s) / Uii;
    }
}


更新:

在Op的第一个k循环中,它的k<j在第二个k<i中,您是否要解决此问题?

In the Op's first k-loop its k<j and in the 2nd k<i don't you have to fix that?

是的,我已修复它. fix1.c的更改太丑陋了,所以我删除了该更改,并将更改应用到了fix2-fix4容易实现的地方.

Yes, I've fixed it. It was too ugly a change for fix1.c, so I removed that and applied the changes to fix2-fix4 where it was easy to do.

更新#2:

这些变量都是函数的局部变量.

These variables are all local to the function.

如果您是说它们是函数范围内的[ without static],则这表示矩阵不能太大,因为除非代码增加了堆栈,否则大小,它们被限制为堆栈大小限制(例如8 MB)

If you mean they are function scoped [without static], this says that the matrixes can't be too large because, unless the code increases the stack size, they're limited to the stack size limit (e.g. 8 MB)

尽管矩阵似乎是VLA(因为n是小写字母),但我忽略了这一点.您可能想尝试使用固定维数组的测试用例,因为我认为它们可能会更快.

Although the matrixes appeared to be VLAs [because n was lowercase], I ignored that. You may want to try a test case using fixed dimension arrays as I believe they may be faster.

此外,如果矩阵属于函数范围,并且想要并行化事物,则可能需要执行(例如)#pragma omp shared(Aprime) shared(U) shared(L).

Also, if the matrixes are function scope, and want to parallelize things, you'd probably need to do (e.g.) #pragma omp shared(Aprime) shared(U) shared(L).

对缓存的最大拖累是计算s的循环.在fix4中,我能够访问友好的U缓存,但是L的访问权限很低.

The biggest drag on cache were the loops to calculate s. In fix4, I was able to make access to U cache friendly, but L access was poor.

如果我确实包含外部上下文,我需要发布更多信息

I'd need to post a whole lot more if I did include the external context

我猜到了很多,所以我推测性地进行了矩阵维交换,不知道还有多少其他代码需要更改.

I guessed as much, so I did the matrix dimension swap speculatively, not knowing how much other code would need changing.

我创建了一个新版本,该版本将L上的尺寸更改回原始方式,但将交换版本保留在其他版本上.这为所有矩阵提供了最佳的缓存性能.也就是说,大多数矩阵访问的内部循环使得每次迭代都沿着高速缓存行递增.

I've created a new version that changes the dimensions on L back to the original way, but keeping the swapped versions on the other ones. This provides the best cache performance for all matrixes. That is, the inner loop for most matrix access is such that each iteration is incrementing along the cache lines.

实际上,请尝试一下.它可以将事情改进到不需要并行的程度.我怀疑代码无论如何都受内存限制,因此并行可能没有太大帮助.

In fact, give it a try. It may improve things to the point where parallel isn't needed. I suspect the code is memory bound anyway, so parallel might not help as much.

// fix5.c -- further transpose to fix poor performance on s calc loops
//
// flip the U dimensions back to original

double U[n][n];
double L[n][n];
double Aprime[n][n];

double *Up;
double *Lp;
double *Ap;

for (i = 0; i < n; i++) {
    Ap = Aprime[i];
    Up = U[i];

    for (j = 0; j <= i; j++) {
        double s = 0;
        Lp = L[j];
        for (k = 0; k < j; k++)
            s += Lp[k] * Up[k];
        Up[j] = Ap[j] - s;
    }

    double Uii = Up[i];

    for (; j < n; j++) {
        double s = 0;
        Lp = L[j];
        for (k = 0; k < i; k++)
            s += Lp[k] * Up[k];
        Lp[i] = (Ap[j] - s) / Uii;
    }
}

即使您确实需要原始尺寸,根据其他代码,您也可以移入移入和移出移出.这将使其他代码保持相同,但是,如果此代码确实是瓶颈,那么额外的转置操作可能会小到值得这样做.

Even if you really need the original dimensions, depending upon the other code, you might be able to transpose going in and transpose back going out. This would keep things the same for other code, but, if this code is truly a bottleneck, the extra transpose operations might be small enough to merit this.

更新#3:

我已经在所有版本上运行了基准测试.以下是n等于1037的经过时间和相对于原始记录的比率:

I've run benchmarks on all the versions. Here are the elapsed times and ratios relative to original for n equal to 1037:

orig: 1.780916929 1.000x
fix1: 3.730602026 0.477x
fix2: 1.743769884 1.021x
fix3: 1.765769482 1.009x
fix4: 1.762100697 1.011x
fix5: 0.452481270 3.936x

比率越高越好.

无论如何,这是我所能做的限制.所以,祝你好运...

Anyway, this is the limit of what I can do. So, good luck ...

这篇关于是否可以并行化此for循环?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆