gcc openmp 线程重用 [英] gcc openmp thread reuse

查看:29
本文介绍了gcc openmp 线程重用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 gcc 的 openmp 实现来尝试并行化程序.基本上,任务是添加 omp pragma 以在找到 友好数字的程序上获得加速.>

给出了原始的串行程序(除了我在最后添加的3行注释外,如下所示).我们必须首先并行化外循环,然后才并行化内循环.外循环很容易,对于给定数量的处理器,我接近理想的加速.对于内部循环,我的性能比原始串行程序差得多.基本上我想做的是减少 sum 变量.

查看 CPU 使用率,我每个核心只使用了 ~30%.什么可能导致这种情况?程序是否在每次遇到 omp parallel for 子句时不断创建新线程?为减少设置障碍是否有更多的开销?或者可能是内存访问问题(例如缓存抖动)?从我读到的大多数 openmp 线程实现都会超时重用(例如池化),所以我不太确定第一个问题是什么地方出了问题.

#include#include#include#include #define numThread 2int main(int argc, char* argv[]) {int ser[29], end, i, j, a, limit, als;als = atoi(argv[1]);limit = atoi(argv[2]);for (i = 2; i < limit; i++) {ser[0] = i;for (a = 1; a <= als; a++) {ser[a] = 1;int prev = ser[a-1];if ((prev > i) || (a == 1)) {结束 = sqrt(prev);int sum = 0;//添加这个#pragma omp parallel for reduction(+:sum) num_threads(numThread)//添加了这个for (j = 2; j <= end; j++) {如果(上一个 % j == 0){总和 += j;sum += prev/j;}}ser[a] = sum + 1;//加上这个}}如果 (ser[als] == i) {printf("%d", i);for (j = 1; j 

解决方案

OpenMP thread team 在进入并行部分时被实例化.这实际上意味着每次内循环开始时都会重复创建线程.

为了实现线程的重用,使用更大的并行部分(以控制团队的生命周期)并专门控制外/内循环的并行度,如下所示:

test.exe 1 1000000 的执行时间使用此修复从 43 秒减少到 22 秒(并且线程数反映了 numThreads 定义的值 + 1>

PS 也许显而易见,并行化内循环似乎不是一种合理的性能衡量标准.但这可能是这个练习的全部意义所在,我不会为此批评这个问题.

#include#include#include#include #define numThread 2int main(int argc, char* argv[]) {int ser[29], end, i, j, a, limit, als;als = atoi(argv[1]);limit = atoi(argv[2]);#pragma omp 并行 num_threads(numThread){#pragma omp 单曲for (i = 2; i < limit; i++) {ser[0] = i;for (a = 1; a <= als; a++) {ser[a] = 1;int prev = ser[a-1];if ((prev > i) || (a == 1)) {结束 = sqrt(prev);int sum = 0;//添加这个#pragma omp parallel for reduction(+:sum)//添加了这个for (j = 2; j <= end; j++) {如果(上一个 % j == 0){总和 += j;sum += prev/j;}}ser[a] = sum + 1;//加上这个}}如果 (ser[als] == i) {printf("%d", i);for (j = 1; j 

I am using gcc's implementation of openmp to try to parallelize a program. Basically the assignment is to add omp pragmas to obtain speedup on a program that finds amicable numbers.

The original serial program was given(shown below except for the 3 lines I added with comments at the end). We have to parallize first just the outer loop, then just the inner loop. The outer loop was easy and I get close to ideal speedup for a given number of processors. For the inner loop, I get much worse performance than the original serial program. Basically what I am trying to do is a reduction on the sum variable.

Looking at the cpu usage, I am only using ~30% per core. What could be causing this? Is the program continually making new threads everytime it hits the omp parallel for clause? Is there just so much more overhead in doing a barrier for the reduction? Or could it be memory access issue(eg cache thrashing)? From what I read with most implementations of openmp threads get reused overtime(eg pooled), so I am not so sure the first problem is what is wrong.

#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>
#define numThread 2
int main(int argc, char* argv[]) {
    int ser[29], end, i, j, a, limit, als;
    als = atoi(argv[1]);
    limit = atoi(argv[2]);
    for (i = 2; i < limit; i++) {
        ser[0] = i;
        for (a = 1; a <= als; a++) {
            ser[a] = 1;
            int prev = ser[a-1];
            if ((prev > i) || (a == 1)) {
                end = sqrt(prev);
                int sum = 0;//added this
                #pragma omp parallel for reduction(+:sum) num_threads(numThread)//added this
                for (j = 2; j <= end; j++) {
                    if (prev % j == 0) {
                        sum += j;
                        sum += prev / j;
                    }
                }
                ser[a] = sum + 1;//added this
            }
        }
        if (ser[als] == i) {
            printf("%d", i);
            for (j = 1; j < als; j++) {
                printf(", %d", ser[j]);
            }
            printf("
");
        }
    }
}

解决方案

OpenMP thread teams are instantiated on entering the parallel section. This means, indeed, that the thread creation is repeated every time the inner loop is starting.

To enable reuse of threads, use a larger parallel section (to control the lifetime of the team) and specificly control the parallellism for the outer/inner loops, like so:

Execution time for test.exe 1 1000000 has gone down from 43s to 22s using this fix (and the number of threads reflects the numThreads defined value + 1

PS Perhaps stating the obvious, it would not appear that parallelizing the inner loop is a sound performance measure. But that is likely the whole point to this exercise, and I won't critique the question for that.

#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>

#define numThread 2
int main(int argc, char* argv[]) {
    int ser[29], end, i, j, a, limit, als;
    als = atoi(argv[1]);
    limit = atoi(argv[2]);
#pragma omp parallel num_threads(numThread)
    {
#pragma omp single
        for (i = 2; i < limit; i++) {
            ser[0] = i;
            for (a = 1; a <= als; a++) {
                ser[a] = 1;
                int prev = ser[a-1];
                if ((prev > i) || (a == 1)) {
                    end = sqrt(prev);
                    int sum = 0;//added this
#pragma omp parallel for reduction(+:sum) //added this
                    for (j = 2; j <= end; j++) {
                        if (prev % j == 0) {
                            sum += j;
                            sum += prev / j;
                        }
                    }
                    ser[a] = sum + 1;//added this
                }
            }
            if (ser[als] == i) {
                printf("%d", i);
                for (j = 1; j < als; j++) {
                    printf(", %d", ser[j]);
                }
                printf("
");
            }
        }
    }
}

这篇关于gcc openmp 线程重用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆