嵌套循环,内部循环并行化,重用线程 [英] nested loops, inner loop parallelization, reusing threads

查看:247
本文介绍了嵌套循环,内部循环并行化,重用线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

免责声明:以下示例只是一个快速理解问题的虚拟示例.如果您正在考虑现实世界中的问题,请考虑任何动态编程.

Disclaimer: following example is just an dummy example to quickly understand the problem. If you are thinking about real world problem, think anything dynamic programming.

问题: 我们有一个n * m矩阵,我们想要从上一行复制元素,如以下代码所示:

The problem: We have an n*m matrix, and we want to copy elements from previous row as in the following code:

for (i = 1; i < n; i++)
    for (j = 0; j < m; j++)
        x[i][j] = x[i-1][j];

方法: 外循环迭代必须按顺序执行,它们将顺序执行. 内环可以并行化.我们希望将创建和杀死线程的开销降至最低,因此我们只想创建一次线程组,但是,这在OpenMP中似乎是不可能完成的任务.

Approach: Outer loop iterations have to be executed in order, they would be executed sequentially. Inner loop can be parallelized. We would want to minimize overhead of creating and killing threads, so we would want to create team of threads just once, however, this seems like an impossible task in OpenMP.

#pragma omp parallel private(j)
{
   for (i = 1; i < n; i++)
   {   
      #pragma omp for scheduled(dynamic)
      for (j = 0; j < m; j++)
         x[i][j] = x[i-1][j];
   }
}

在外循环上应用ordered选项时,代码将按顺序执行,因此不会提高性能. 即使必须使用某些解决方法,我仍在寻找上述方案的解决方案.

When we apply ordered option on the outer loop, the code will be executed sequential way, so there will be no performance gain. I am looking to solution for the scenario above, even if I had to use some workaround.

我正在添加我的实际代码.这实际上比seq慢.版本.请查看:

I am adding my actual code. This is is actually slower than seq. version. Please review:

/* load input */
for (i = 1; i <= n; i++)
    scanf ("%d %d", &in[i][W], &in[i][V]);

/* init */
for (i = 0; i <= wc; i++)
    a[0][i] = 0;

/* compute */
#pragma omp parallel private(i,w)
{
    for(i = 1; i <= n; ++i) // 1 000 000
    {
        j=i%2;
        jn = j == 1 ? 0 : 1;

        #pragma omp for
        for(w = 0; w <= in[i][W]; w++) // 1000
            a[j][w] = a[jn][w];

        #pragma omp for
        for(w = in[i][W]+1; w <= wc; w++) // 350 000
            a[j][w] = max(a[jn][w], in[i][V] + a[jn][w-in[i][W]]);
    }
}

对于测量,我使用的是这样的东西:

As for measuring, I am using something like this:

double t;
t = omp_get_wtime();
// ...
t = omp_get_wtime() - t;

推荐答案

对于这种特殊情况,总结一下OpenMP中的并行化:不值得.

To sum up the parallelization in OpenMP for this particular case: It is not worth it.

为什么? 内部循环中的操作很简单.代码是用-O3编译的,因此max()调用可能已替换为函数体的代码. 隐式屏障的开销可能足够高,以补偿性能增益,而总体开销又足够高,以致并行代码甚至比顺序代码还要慢. 我还发现,这种构造并没有真正的性能提升:

Why? Operations in the inner loops are simple. Code was compiled with -O3, so max() call was probably substituted with the code of function body. Overhead in implicit barrier is probably high enough, to compensate the performance gain, and overall overhead is high enough to make the parallel code even slower than the sequential code was. I also found out, there is no real performance gain in such construct:

#pragma omp parallel private(i,j)
{ 
   for (i = 1; i < n; i++)
   {   
      #pragma omp for
      for (j = 0; j < m; j++)
         x[i][j] = x[i-1][j];
   }
}

因为它的性能类似于此

for (i = 1; i < n; i++)
{   
   #pragma omp parallel for private(j)
   for (j = 0; j < m; j++)
      x[i][j] = x[i-1][j];
}

根据这篇文章,感谢GCC libgomp中的内置线程重用: http ://bisqwit.iki.fi/story/howto/openmp/

thanks to built-in thread reusing in GCC libgomp, according to this article: http://bisqwit.iki.fi/story/howto/openmp/

由于外部循环不能被并行化(没有ordered选项),因此看起来没有办法使用OpenMP来显着提高所讨论程序的性能.如果有人觉得我做错了,并且有可能,我将很高兴看到并测试解决方案.

Since the outer loop cannot be paralellized (without ordered option) it looks there is no way to significantly improve performance of the program in question using OpenMP. If someone feels I did something wrong, and it is possible, I'll be glad to see and test the solution.

这篇关于嵌套循环,内部循环并行化,重用线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆