OpenMP的经济增长大幅减速对特定线程号 [英] OpenMP drastic slowdown for specific thread number

查看:219
本文介绍了OpenMP的经济增长大幅减速对特定线程号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我跑了OpenMP程序执行雅可比方法,并且它工作得很好,2线程执行略微超过2倍1线程和4线程2倍比1线程更快。我觉得一切都完美的工作......直到我达到了整整20,22和24线程。我不停地打破它,直到我有这个简单的程序

I ran an OpenMP program to perform the Jacobi method, and it was working very well, 2 threads performed slightly over 2x 1 thread, and 4 threads 2x faster than 1 thread. I felt everything was working perfectly... until I reached exactly 20, 22, and 24 threads. I kept breaking it down until I had this simple program

#include <stdio.h>
#include <omp.h>

int main(int argc, char *argv[]) {
    int i, n, maxiter, threads, nsquared, execs = 0;
    double begin, end;

    if (argc != 4) {
        printf("4 args\n");
        return 1;
    }

    n = atoi(argv[1]);
    threads = atoi(argv[2]);
    maxiter = atoi(argv[3]);
    omp_set_num_threads(threads);
    nsquared = n * n;

    begin = omp_get_wtime();
    while (execs < maxiter) {

#pragma omp parallel for
        for (i = 0; i < nsquared; i++) {
            //do nothing
        }
        execs++;
    }
    end = omp_get_wtime();

    printf("%f seconds\n", end - begin);

    return 0;
}

和这里是一些输出不同的线程数:

And here is some output for different thread numbers:

./a.out 500 1 1000
    0.6765799 seconds

./a.out 500 8 1000
    0.0851808 seconds

./a.out 500 20 1000
    19.5467 seconds

./a.out 500 22 1000
    21.2296 seconds

./a.out 500 24 1000
    20.1268 seconds

./a.out 500 26 1000
    0.1363 seconds

我明白一个很大的下滑,如果它持续了20以下的所有线程,因为我会想出这将是一个线程开销(虽然我觉得这是一个有点极端)。但是,即使将N-叶20,22和24的时间保持不变。更改MAXITER 100没有它的规模下降到约1.9秒2.2秒,......,单独这意味着创建线程引起的放缓,而不是内部的迭代。

I would understand a big slowdown if it continued for all threads following 20, because I would figure that would be the thread overhead (though I felt it was a bit extreme). But even changing n leaves the times of 20, 22, and 24 to remain the same. Changing maxiter to 100 does scale it down to about 1.9 seconds, 2.2 seconds, ..., meaning the thread creation alone is causing the slowdown, not the internal iteration.

这是什么做的OS尝试创建它没有线程?如果这意味着什么, omp_get_num_procs() 24返回,这是英特尔至强处理器(所以24包括超线程?)

Is this something to do with the OS attempting to create threads it doesn't have? If it means anything, omp_get_num_procs() returns 24, and it is on Intel Xeon processors (so the 24 includes hyper-threading?)

感谢您的帮助。

推荐答案

我怀疑问题是由于一个线程在一个内核上运行在100%。由于超线程这实在是费时两个线程。你需要找到导致此的核心,并尝试和排除。让我们假设它的螺纹20和21(你说的它开始在你的问题20 - ?你确定吗)。尝试这样的事情

I suspect the problem is due to one thread running at 100% on one core. Due to hyper-threading this is really consuming two threads. You need to find the core that is causing this and try and exclude it. Let's assume it's threads 20 and 21 (you said it starts at 20 in your question - are you sure about this?). Try something like this

GOMP_CPU_AFFINITY = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 22 23

我从来没有使用过这之前,所以你可能需要阅读了关于这个有点得到它的权利。
OpenMP和CPU亲和力你可能需要先列出甚至线程奇(如0 2在这种情况下,我不知道要排除的4 ... 22 1 3 5 ...)(编辑:解得:出口GOMP_CPU_AFFINITY =0-17 20-24 查看评论)

I have never used this before so you might need to read up on this a bit to get it right. OpenMP and CPU affinity You might need to list the even threads first and then odd (e.g. 0 2 4 ... 22 1 3 5 ...) in which case I'm not sure what to exclude ( the solution was: export GOMP_CPU_AFFINITY="0-17 20-24. See the comments).

至于为什么26线程不会有问题,我只能猜测。 OpenMP的可以选择线程迁移到不同的内核。您的系统可以运行24个逻辑线程。我从来没有发现一个理由,线程数设置为我的矩阵乘法code比逻辑线程的数目越大(事实上是一个值我的线程数设置为物理核心的数量,因为超线程给更糟糕的结果)。也许当你设置的线程数比逻辑内核的数量更大的值的OpenMP决定它没关系,当它选择迁移线程。如果您的迁移线程远离运行在100%的核心则问题可能消失。您可能能够通过禁用 OMP_PROC_BIND

As to why 26 threads would not have the problem I can only guess. OpenMP can choose to migrate the threads to different cores. Your system can run 24 logical threads. I have never found a reason to set the number of threads to a value larger than the number of logical threads (in fact in my matrix multiplication code I set the number of threads to the number of physical cores since hyper-threading gives a worse result). Maybe when you set the number of threads to a value larger than the number of logical cores OpenMP decides it's okay to migrate threads when it chooses. If it migrates your threads away from the core running at 100% then the problem could go away. You might be able to test this by disabling thread migration with OMP_PROC_BIND

这篇关于OpenMP的经济增长大幅减速对特定线程号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆