gcc openmp线程重用 [英] gcc openmp thread reuse

查看:206
本文介绍了gcc openmp线程重用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用gcc的openmp实现来尝试并行化一个程序。基本上这个任务是添加omp编译指示来获得加速的程序,找到。 p>

给出了原始的串行程序(下面显示,除了最后添加的3条注释外)。我们必须首先简化外环,然后再简化内环。外部循环很容易,对于给定数量的处理器,我接近理想的加速。对于内部循环,我得到比原来的串行程序更糟糕的性能。基本上我试图做的是减少总和变量。

看看CPU的使用情况,我只使用〜每个核心30%。什么可能导致这个?程序是否每次碰到omp并行for子句时都会不断创建新线程?在减少障碍方面是否有更多的开销?或者可能是内存访问问题(例如缓存抖动)?从我读到的大多数openmp线程的实现中得到的是重用加班(例如pooled),所以我不太确定第一个问题是什么错。

 #包括< stdio.h中> 
#include< stdlib.h>
#include< math.h>
#include< omp.h>
#define numThread 2
int main(int argc,char * argv []){
int ser [29],end,i,j,a,limit,als;
als = atoi(argv [1]);
limit = atoi(argv [2]);
for(i = 2; i ser [0] = i; (a = 1; a< = als; a ++){
ser [a] = 1;
;
int prev = ser [a-1]; ((prev> i)||(a == 1)){
end = sqrt(prev);
int sum = 0; //添加这个
#pragma omp parallel for reduction(+:sum)num_threads(numThread)//将这个
加到(j = 2; j <= end; j ++){
if(prev%j == 0){
sum + = j;
sum + = prev / j;


ser [a] = sum + 1; // added this
}
}
if(ser [als] == i ){
printf(%d,i); (j = 1; j printf(,%d,ser [j])的
;
}
printf(\\\
);




$ div class =h2_lin>解决方案 线程团队 被实例化。这意味着,每当内部循环开始时,线程的创建都会重复。



为了使线程可以重用,使用更大的并行部分(来控制生命周期),并具体控制外部/内部循环的并行性,如下所示:
$ b test.exe的执行时间1 1000000
已经从43s下降到22s使用这个修复(和线程数量反映了numThreads定义的值+ 1



也许明显的是,并不是说内部循环的并行化是一个很好的性能指标,但这可能是这个工作的重点,我不会批评这个问题。
$ b $ $ p $ #include< stdio.h>
#include< stdlib.h>
#include<数学.h>
#include< omp.h>

#define numThread 2
int main(int argc,char * argv []){
int ser [29],结束,i,j,a,限制,als;
als =的atoi(argv的[1]);
limit = atoi(argv [2]);
#pragma omp parallel num_threads(numThread)
{
#pragma omp single
for(i = 2; i ser [0 ] = i; (a = 1; a< = als; a ++){
ser [a] = 1;
;
int prev = ser [a-1]; ((prev> i)||(a == 1)){
end = sqrt(prev);
int sum = 0; //添加这个
#pragma omp parallel for reduction(+:sum)//添加这个
(j = 2; j <= end; j ++) {
if(prev%j == 0){
sum + = j;
sum + = prev / j;


ser [a] = sum + 1; // added this
}
}
if(ser [als] == i ){
printf(%d,i); (j = 1; j printf(,%d,ser [j])的
;
}
printf(\\\
);
}
}
}
}


I am using gcc's implementation of openmp to try to parallelize a program. Basically the assignment is to add omp pragmas to obtain speedup on a program that finds amicable numbers.

The original serial program was given(shown below except for the 3 lines I added with comments at the end). We have to parallize first just the outer loop, then just the inner loop. The outer loop was easy and I get close to ideal speedup for a given number of processors. For the inner loop, I get much worse performance than the original serial program. Basically what I am trying to do is a reduction on the sum variable.

Looking at the cpu usage, I am only using ~30% per core. What could be causing this? Is the program continually making new threads everytime it hits the omp parallel for clause? Is there just so much more overhead in doing a barrier for the reduction? Or could it be memory access issue(eg cache thrashing)? From what I read with most implementations of openmp threads get reused overtime(eg pooled), so I am not so sure the first problem is what is wrong.

#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>
#define numThread 2
int main(int argc, char* argv[]) {
    int ser[29], end, i, j, a, limit, als;
    als = atoi(argv[1]);
    limit = atoi(argv[2]);
    for (i = 2; i < limit; i++) {
        ser[0] = i;
        for (a = 1; a <= als; a++) {
            ser[a] = 1;
            int prev = ser[a-1];
            if ((prev > i) || (a == 1)) {
                end = sqrt(prev);
                int sum = 0;//added this
                #pragma omp parallel for reduction(+:sum) num_threads(numThread)//added this
                for (j = 2; j <= end; j++) {
                    if (prev % j == 0) {
                        sum += j;
                        sum += prev / j;
                    }
                }
                ser[a] = sum + 1;//added this
            }
        }
        if (ser[als] == i) {
            printf("%d", i);
            for (j = 1; j < als; j++) {
                printf(", %d", ser[j]);
            }
            printf("\n");
        }
    }
}

解决方案

OpenMP thread teams are instantiated on entering the parallel section. This means, indeed, that the thread creation is repeated every time the inner loop is starting.

To enable reuse of threads, use a larger parallel section (to control the lifetime of the team) and specificly control the parallellism for the outer/inner loops, like so:

Execution time for test.exe 1 1000000 has gone down from 43s to 22s using this fix (and the number of threads reflects the numThreads defined value + 1

PS Perhaps stating the obvious, it would not appear that parallelizing the inner loop is a sound performance measure. But that is likely the whole point to this exercise, and I won't critique the question for that.

#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>

#define numThread 2
int main(int argc, char* argv[]) {
    int ser[29], end, i, j, a, limit, als;
    als = atoi(argv[1]);
    limit = atoi(argv[2]);
#pragma omp parallel num_threads(numThread)
    {
#pragma omp single
        for (i = 2; i < limit; i++) {
            ser[0] = i;
            for (a = 1; a <= als; a++) {
                ser[a] = 1;
                int prev = ser[a-1];
                if ((prev > i) || (a == 1)) {
                    end = sqrt(prev);
                    int sum = 0;//added this
#pragma omp parallel for reduction(+:sum) //added this
                    for (j = 2; j <= end; j++) {
                        if (prev % j == 0) {
                            sum += j;
                            sum += prev / j;
                        }
                    }
                    ser[a] = sum + 1;//added this
                }
            }
            if (ser[als] == i) {
                printf("%d", i);
                for (j = 1; j < als; j++) {
                    printf(", %d", ser[j]);
                }
                printf("\n");
            }
        }
    }
}

这篇关于gcc openmp线程重用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆