为什么OMP版本比串行版本慢? [英] Why omp version is slower than serial?

查看:101
本文介绍了为什么OMP版本比串行版本慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是对此人的后续问题
现在我有了代码:

It's a follow-up question to this one
Now I have the code:

#include <iostream>
#include <cmath>
#include <omp.h>


#define max(a, b) (a)>(b)?(a):(b)

const int m = 2001;
const int n = 2000;
const int p = 4;

double v[m + 2][m + 2];
double x[m + 2];
double y[m + 2];
double _new[m + 2][m + 2];
double maxdiffA[p + 1];
int icol, jrow;

int main() {
    omp_set_num_threads(p);

    double h = 1.0 / (n + 1);

    double start = omp_get_wtime();

    #pragma omp parallel for private(icol) shared(x, y, v, _new)
    for (icol = 0; icol <= n + 1; ++icol) {
        x[icol] = y[icol] = icol * h;

        _new[icol][0] = v[icol][0] = 6 - 2 * x[icol];

        _new[n + 1][icol] = v[n + 1][icol] = 4 - 2 * y[icol];

        _new[icol][n + 1] = v[icol][n + 1] = 3 - x[icol];

        _new[0][icol] = v[0][icol] = 6 - 3 * y[icol];
    }


    const double eps = 0.01;


    #pragma omp parallel private(icol, jrow) shared(_new, v, maxdiffA)
    {
        while (true) { //for [iters=1 to maxiters by 2]
            #pragma omp single
            for (int i = 0; i < p; i++) maxdiffA[i] = 0;
            #pragma omp for
            for (icol = 1; icol <= n; icol++)
                for (jrow = 1; jrow <= n; jrow++)
                    _new[icol][jrow] =
                            (v[icol - 1][jrow] + v[icol + 1][jrow] + v[icol][jrow - 1] + v[icol][jrow + 1]) / 4;
            #pragma omp for
            for (icol = 1; icol <= n; icol++)
                for (jrow = 1; jrow <= n; jrow++)
                    v[icol][jrow] = (_new[icol - 1][jrow] + _new[icol + 1][jrow] + _new[icol][jrow - 1] +
                                     _new[icol][jrow + 1]) / 4;

            #pragma omp for
            for (icol = 1; icol <= n; icol++)
                for (jrow = 1; jrow <= n; jrow++)
                    maxdiffA[omp_get_thread_num()] = max(maxdiffA[omp_get_thread_num()],
                                                         fabs(_new[icol][jrow] - v[icol][jrow]));

            #pragma omp barrier

            double maxdiff = 0.0;
            for (int k = 0; k < p; ++k) {
                maxdiff = max(maxdiff, maxdiffA[k]);
            }


            if (maxdiff < eps)
                break;
            #pragma omp barrier
            //#pragma omp single
            //std::cout << maxdiff << std::endl;
        }
    }
    double end = omp_get_wtime();
    printf("start = %.16lf\nend = %.16lf\ndiff = %.16lf\n", start, end, end - start);

    return 0;
}

但是为什么它的工作速度比串行模拟慢2-3倍(32秒和18秒):

But why it works 2-3 times slower (32sec vs 18sec) than serial analog:

#include <iostream>
#include <cmath>
#include <omp.h>

#define max(a,b) (a)>(b)?(a):(b)

const int m = 2001;
const int n = 2000;
double v[m + 2][m + 2];
double x[m + 2];
double y[m + 2];
double _new[m + 2][m + 2];

int main() {
    double h = 1.0 / (n + 1);

    double start = omp_get_wtime();

    for (int i = 0; i <= n + 1; ++i) {
        x[i] = y[i] = i * h;

        _new[i][0]=v[i][0] = 6 - 2 * x[i];

        _new[n + 1][i]=v[n + 1][i] = 4 - 2 * y[i];

        _new[i][n + 1]=v[i][n + 1] = 3 - x[i];

        _new[0][i]=v[0][i] = 6 - 3 * y[i];
    }

    const double eps=0.01;
    while(true){ //for [iters=1 to maxiters by 2]
        double maxdiff=0.0;
        for (int i=1;i<=n;i++)
            for (int j=1;j<=n;j++)
                _new[i][j]=(v[i-1][j]+v[i+1][j]+v[i][j-1]+v[i][j+1])/4;
        for (int i=1;i<=n;i++)
            for (int j=1;j<=n;j++)
                v[i][j]=(_new[i-1][j]+_new[i+1][j]+_new[i][j-1]+_new[i][j+1])/4;

        for (int i=1;i<=n;i++)
            for (int j=1;j<=n;j++)
                maxdiff=max(maxdiff, fabs(_new[i][j]-v[i][j]));

        if(maxdiff<eps) break;
        std::cout << maxdiff<<std::endl;
    }

    double end = omp_get_wtime();
    printf("start = %.16lf\nend = %.16lf\ndiff = %.16lf\n", start, end, end - start);

    return 0;
}

同样有趣的是,它看起来像这样

Also interesting that it works SAME TIME as version (I can post it here if you say so) which looks like so

while(true){ //106 iteratins here!!!
#pragma omp paralell for
for(...)
#pragma omp paralell for
for(...)
#pragma omp paralell for
for(...)
}

但是我认为让omp代码变慢的原因是在while循环内生成线程106次...但是没有!然后,线程可能同时写入相同的数组单元.但是,它在哪里发生?我看不到,你能告诉我吗?
也许是因为障碍太多?但是讲师告诉我要像这样实现代码并分析",也许答案是"Jacobi算法并不意味着并行运行良好"?还是我la脚的编码?

But I thought that what making omp code slow is spawning threads inside while loop 106 times... But no! Then probably threads simultaneously write to the same array cells.. But where does it happen? I don't see it could you show me please?
Maybe it's because too much barriers? But Lecturer told me to implement the code like so and "analyse it" Maybe the answer is "Jacobi algorithm isn't meant to run well in parallel"? Or it's just my lame coding?

推荐答案

我正在写信,以使您了解一些情况.写评论不短,所以我决定写一个答案.

I am writing to make you aware of a few situations. It is not short to write in a comment, so I decided to write as an answer.

每次创建线程时,创建线程都会花费一些时间.如果您的程序在单核中的运行时间很短,那么创建线程将使多核的时间更长.

every time a thread is made, it takes some time for its creation. if your program's running time in a single core is short, then the creation of threads will make this time longer for multi-core.

加上使用屏障会使您的所有线程等待其他线程,这可能会以某种方式减慢cpu的速度.这样,即使所有线程都非常快地完成了工作,最后一个线程也会使总运行时间更长.

plus using a barrier makes all your threads wait for others, which could somehow be slowed down in cpu. this way, even if all threads finish the job very fast, that last one will make the total run time longer.

尝试使用更大的数组运行程序,其中单线程处理的时间约为2分钟.然后进入多核.

try to run your program with bigger sized arrays where time is around 2 minutes for single threading. then make your way to multi-core.

然后尝试将您的主代码包装在一个正常的循环中以运行几次,并为每个循环打印时序.由于加载了库,循环的第一次运行可能会很慢,但是为了证明速度的提高,下一次运行应该更快.

then try to wrap your main code in a normal loop to run it a few times and prints the timings for each. the first run of the loop might be slow because of loading libraries, but the next runs should be faster to prove the increasing speed.

如果以上建议没有给出结果,则意味着您的编码需要更多编辑.

if above suggestions do not give a result, then it means your coding needs more editing.

对于拒绝投票的人,如果您不喜欢该帖子,请至少保持礼貌并发表评论.或者更好地,给出您自己的答案,以便对社区有所帮助.

To downvoters, If you don't like a post, please at least be polite and leave a comment. Or better, give your own answer so be helpful to community.

这篇关于为什么OMP版本比串行版本慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆