OpenMP - 当在外部循环之前具有并行时,嵌套for循环变得更快。为什么? [英] OpenMP - Nested for-loop becomes faster when having parallel before outer loop. Why?

查看:2222
本文介绍了OpenMP - 当在外部循环之前具有并行时,嵌套for循环变得更快。为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在实现一个动态编程算法来解决背包问题。因此,我的代码有两个for循环,一个外循环和一个内循环。



从逻辑的角度来看,我可以并行化内部for循环,因为计算是彼此独立的。



这是我的第一种方法

  for(int i = 1; i  int itemsIndex = i-1; 
int itemWeight = integerItems [itemsIndex] .weight;
int itemWorth = integerItems [itemsIndex] .worth;

#pragma omp parallel for if(weightColumns> THRESHOLD)
for(int c = 1; c< weightColumns; c ++){
if(c< itemWeight) {
table [i] [c] = table [i-1] [c];
} else {
int worthOfNotUsingItem = table [i-1] [c];
int worthOfUsingItem = itemWorth + table [i-1] [c-itemWeight];
table [i] [c] = worthOfNotUsingItem< valueOfUsingItem?值得
}
}
}

代码运行良好,算法正确地解决了问题。
然后我正在考虑优化这个,因为我不知道OpenMP的线程管理如何工作。我想在每次迭代期间防止线程的不必要的初始化,因此我在外层循环周围放置一个外部并行块。



第二种方法:

  #pragma omp parallel if(weightColumns> THRESHOLD)
{
for(int i = 1 ; i< itemRows; i ++){
int itemsIndex = i-1;
int itemWeight = integerItems [itemsIndex] .weight;
int itemWorth = integerItems [itemsIndex] .worth;

#pragma omp for
for(int c = 1; c< weightColumns; c ++){
if(c table [i ] [c] = table [i-1] [c];
} else {
int worthOfNotUsingItem = table [i-1] [c];
int worthOfUsingItem = itemWorth + table [i-1] [c-itemWeight];
table [i] [c] = worthOfNotUsingItem< valueOfUsingItem?值得
}
}
}
}

具有不必要的副作用:并行块中的所有内容现在将执行n次,其中n是可用内核的数量。我已经尝试使用pragmas single critical 来强制外部for循环在一个线程中执行,但是我不能通过多个线程计算内循环,除非我打开一个新的并行块(但是速度没有增益)。但永远不会,因为好东西是:这不影响结果。



现在的STRANGE THING:第二种方法比第一种方法更快!



这怎么可能?我的意思是,虽然外部for循环计算n次(并行),内部for循环在n个核心之间分布n次,它比第一种方法更快,第一种方法只计算外部循环一次,并分配工作负载



一开始我在想:好吧,这可能是因为线程管理,但后来我读到OpenMP池实例化的线程会反对我的假设。然后我禁用编译器优化(编译器标志-O0)来检查它是否有关系。但这并不影响测量。



你们中的任何人能给这些更多的光吗?



测量时间,用于解决包含7500个项的背包问题,最大容量为45000(创建一个7500x45000的矩阵,这超出了代码中使用的THRESHOLD变量):




  • 方法1:〜0.88s

  • 方法2:〜0.52s



提前感谢,



phineliner





测量更复杂的问题:
向问题添加了2500个项目(从7500到10000)




  • 方法1:〜1.19s

  • 方法2 :〜0.71s



EDIT2
我误认为编译器优化。这不影响测量。至少我不能重现我之前测量的差异。我根据这个编辑了问题文本。

解决方案

让我们先考虑你的代码在做什么。基本上你的代码正在转换一个矩阵(二维数组),其中的行的值取决于前一行,但列的值是独立于其他列。让我选择一个更简单的例子

  for(int i = 1; i  for(int j = 0; j a [i * n + j] + = a [(i-1)* n + j] 
}
}

并行化的一种方法是交换循环



方法1:

  #pragma omp parallel for 
for(int j = 0; j for(int i = 1; i a [i * n + j] + = a [ i-1)* n + j];使用此方法,每个线程都运行所有<$ c











内部循环的 n / nthreads n-1
j 的迭代。这有效地并行地处理列的条。但是,这种方法高度缓存不友好。



另一种可能性是只并行内部循环。



2:

  for(int i = 1; i  #pragma omp parallel for b $ b for(int j = 0; j a [i * n + j] + = a [(i-1)* n + j] 
}
}

这实际上是并行处理单行中的列但每行依次。 i 的值只能由主线程运行。



另一种方式是并行处理列,行依次是:



方法3:

  #pragma omp parallel 
for(int i = 1; i #pragma omp for
for(int j = 0; j< n; j ++){
a [i * n + j] + = a [(i-1)* n + j];
}
}


$ b $ p

在此方法中,与方法1相似,在 n-1 上迭代 i 。然而,这种方法在内部循环之后有一个隐式障碍,使得每个线程暂停,直到所有线程都完成一个行,使得该方法对于每一行都是顺序的,如方法2.



最好的解决方案是像方法1一样并行处理列的列,但仍然是缓存友好的。这可以使用 nowait 子句来实现。



方法4:

  #pragma omp parallel 
for(int i = 1; i #pragma omp for nowait
(int j = 0; j a [i * n + j] + = a [(i-1)* n + j]
}
}

在我的测试中, nowait 子句没有太大的区别。这可能是因为负载是均匀的(这就是为什么静态调度在这种情况下是理想的)。如果负载较小, nowait 可能会产生更大的区别。



n = 3000 在我的四核IVB系统GCC 4.9.2:

 方法1:3.00 
方法2:0.26
方法3:0.21
方法4:0.21

这个测试可能是内存带宽限制,所以我可以选择更好的情况下使用更多的计算,但尽管如此,差异是足够大的。为了消除由于创建线程池的偏见,我运行的方法之一,而不先计时。



从清除缓存友好方法1的时间是清楚的。还清楚的是方法3比方法2更快,并且 nowait 在这种情况下没有什么影响。



2和方法3都并行地处理行中的列,但是顺序地行可以预期它们的定时是相同的。那么为什么他们不同?让我提出一些意见:


  1. 由于线程池,线程不会被创建,的方法2,所以它不清楚我什么额外的开销。注意OpenMP什么都不说一个线程池。这是每个编译器实现的。


  2. 方法3和方法2之间唯一的区别是,在方法2中,只有主线程处理 i ,而在方法3中,每个线程处理私有 i 。但是对于我来说这对于解释方法之间的显着差异似乎太微不足道了,因为方法3中的隐式屏障使它们同步,并且处理 i 是一个增量的问题,条件测试。


  3. 方法3没有比并行处理整个列的方法4慢的事实,方法2中的额外开销是为 i




的每次迭代离开和输入并行区域。 p>所以我的结论是,解释为什么方法2这么慢,方法3需要查看线程池的实现。对于使用pthread的GCC,这可能是通过创建一个线程池的玩具模型来解释的,但是我还没有足够的经验。


I'm currently implementing an dynamic programming algorithm for solving knapsack problems. Therefore my code has two for-loops, an outer and an inner loop.

From the logical point of view I can parallelize the inner for loop since the calculations there are independant from each other. The outer for loop can not be parallelized because of dependancies.

So this was my first approach:

for(int i=1; i < itemRows; i++){
        int itemsIndex = i-1;
        int itemWeight = integerItems[itemsIndex].weight;
        int itemWorth = integerItems[itemsIndex].worth;

        #pragma omp parallel for if(weightColumns > THRESHOLD)
        for(int c=1; c < weightColumns; c++){
            if(c < itemWeight){
                table[i][c] = table[i-1][c];
            }else{
                int worthOfNotUsingItem = table[i-1][c];
                int worthOfUsingItem = itemWorth + table[i-1][c-itemWeight];
                table[i][c] = worthOfNotUsingItem < worthOfUsingItem ? worthOfUsingItem : worthOfNotUsingItem;
            }
        }
}

The code works well, the algorithm solves the problems correctly. Then I was thinking about optimizing this, since I was not sure how OpenMP's thread management works. I wanted to prevent unnecessary initialization of the threads during each iteration, thus I put an outer parallel block around the outer loop.

Second approach:

#pragma omp parallel if(weightColumns > THRESHOLD)
{
    for(int i=1; i < itemRows; i++){
        int itemsIndex = i-1;
        int itemWeight = integerItems[itemsIndex].weight;
        int itemWorth = integerItems[itemsIndex].worth;

        #pragma omp for
        for(int c=1; c < weightColumns; c++){
            if(c < itemWeight){
                table[i][c] = table[i-1][c];
            }else{
                int worthOfNotUsingItem = table[i-1][c];
                int worthOfUsingItem = itemWorth + table[i-1][c-itemWeight];
                table[i][c] = worthOfNotUsingItem < worthOfUsingItem ? worthOfUsingItem : worthOfNotUsingItem;
            }
        }
     }
}

This has an unwanted side effect: Everything inside the parallel block will now be executed n-times, where n is the number of available cores. I already tried to work with pragmas single and critical to force the outer for-loop to be executed in one thread, but then I can not calculate the inner loop by multiple threads unless I open a new parallel block (but then there would be no gain in speed). But nevermind, because the good thing is: this does not affect the result. The problems are still solved correctly.

NOW THE STRANGE THING: The second approach is FASTER than the first one!

How can this be? I mean, although the outer for-loop is calculated n times (in parallel) and the inner for-loop is distributed n times among n cores it is faster than the first approach which only calculates the outer loop one time and distributes the workload of the inner for-loop evenly.

At first I was thinking: "well yeah, it's probably because of thread management" but then I read that OpenMP pools the instantiated threads which would speak against my assumption. Then I disabled compiler optimization (compiler flag -O0) to check if it has something to do with. But this did not affect the measurement.

Can anybody of you shed more light into this please?

The measured times for solving the knapsack-problem containing 7500 items with a max capacity of 45000 (creating a matrix of 7500x45000, which is way over the used THRESHOLD variable in code):

  • Approach 1: ~0.88s
  • Approach 2: ~0.52s

Thanks in advance,

phineliner

EDIT:

Measurement of a more complex problem: Added 2500 items to the problem (from 7500 to 10000) (more complex problems can't currently be handled because of memory reasons).

  • Approach 1: ~1.19s
  • Approach 2: ~0.71s

EDIT2: I was mistaken about the compiler optimization. This does not affect measurement. At least I can not reproduce the difference I measured before. I edited the question text according to this.

解决方案

Let's first consider what your code is doing. Essentially your code is transforming a matrix (2D array) where the values of the rows depend on the previous row but the values of the columns are independent of other columns. Let me choose a simpler example of this

for(int i=1; i<n; i++) {
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

One way to parallelize this is to swap the loops like this

Method 1:

#pragma omp parallel for
for(int j=0; j<n; j++) {
    for(int i=1; i<n; i++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

With this method each thread runs all n-1 iterations of i of the inner loop but only n/nthreads iterations of j. This effectively processes strips of columns in parallel. However, this method is highly cache unfriendly.

Another possibility is to only parallelize the inner loop.

Method 2:

for(int i=1; i<n; i++) {
    #pragma omp parallel for 
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

This essentially processes the columns in a single row in parallel but each row sequentially. The values of i are only run by the master thread.

Another way to process the columns in parallel but each row sequentially is:

Method 3:

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In this method, like method 1, each thread runs over all n-1 iteration over i. However, this method has an implicit barrier after the inner loop which causes each thread to pause until all threads have finished a row making this method sequential for each row like method 2.

The best solution is one which processes strips of columns in parallel like method 1 but is still cache friendly. This can be achieved using the nowait clause.

Method 4:

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for nowait
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In my tests the nowait clause does not make much difference. This is probably because the load is even (which is why static scheduling is ideal in this case). If the load was less even nowait would probably make more of a difference.

Here are the times in seconds for n=3000 on my my four cores IVB system GCC 4.9.2:

method 1: 3.00
method 2: 0.26 
method 3: 0.21
method 4: 0.21

This test is probably memory bandwidth bound so I could have chosen a better case using more computation but nevertheless the differences are significant enough. In order to remove a bias due to creating the thread pool I ran one of the methods without timing it first.

It's clear from the timing how un-cache friendly method 1 is. It's also clear method 3 is faster than method 2 and that nowait has little effect in this case.

Since method 2 and method 3 both processes columns in a row in parallel but rows sequentially one might expect their timing to be the same. So why do they differ? Let me make some observations:

  1. Due to a thread pool the threads are not created and destroyed for each iteration of the outer loop of method 2 so it's not clear to me what the extra overhead is. Note that OpenMP says nothing about a thread pool. This is something that each compiler implements.

  2. The only other difference between method 3 and method 2 is that in method 2 only the master thread processes i whereas in method 3 each thread processes a private i. But this seems too trivially to me to explain the significant difference between the methods because the implicit barrier in method 3 causes them to sync anyway and processing i is a matter of an increment and a conditional test.

  3. The fact that method 3 is no slower than method 4 which processes whole strips of columns in parallel says the extra overhead in method 2 is all in leaving and entering a parallel region for each iteration of i

So my conclusion is that to explain why method 2 is so much slower than method 3 requires looking into the implementation of the thread pool. For GCC which uses pthreads, this could probably be explained by creating a toy model of a thread pool but I don't have enough experience with that yet.

这篇关于OpenMP - 当在外部循环之前具有并行时,嵌套for循环变得更快。为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆