使用OpenMP无法加速 [英] No speedup with OpenMP

查看:690
本文介绍了使用OpenMP无法加速的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在与OpenMP一起使用,以获取具有接近线性加速的算法. 不幸的是,我注意到我无法达到理想的加速比.

I am working with OpenMP in order to obtain an algorithm with a near-linear speedup. Unfortunately I noticed that I could not get the desired speedup.

因此,为了理解我的代码中的错误,我编写了另一个简单的代码,只是仔细检查了加速原理上是否可以在我的硬件上获得.

So, in order to understand the error in my code, I wrote another code, an easy one, just to double-check that the speedup was in principle obtainable on my hardware.

这是我写的玩具示例:

#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"

int main () {
      int number_of_threads = 1;
      int n = 600;
      int m = 50;
      int N = n/number_of_threads;
      int time_limit = 600;
      double total_clock = omp_get_wtime();
      int time_flag = 0;

      #pragma omp parallel num_threads(number_of_threads)
       {
          int thread_id = omp_get_thread_num();
          int iteration_number_local = 0;
          double *C = new double[n]; std::fill(C, C+n, 3.0);
          double *D = new double[n]; std::fill(D, D+n, 3.0);
          double *CD = new double[n]; std::fill(CD, CD+n, 0.0);

          while (time_flag == 0){
                for (int i = 0; i < N; i++)                     
                    for(int z = 0; z < m; z++)
                        for(int x = 0; x < n; x++)
                            for(int c = 0; c < n; c++){
                                CD[c] = C[z]*D[x];
                                C[z] = CD[c] + D[x];
                            }
                iteration_number_local++;
                if ((omp_get_wtime() - total_clock) >= time_limit) 
                    time_flag = 1; 
           }
       #pragma omp critical
       std::cout<<"I am "<<thread_id<<" and I got" <<iteration_number_local<<"iterations."<<std::endl;
       }
    }

我想再次强调一下,该代码只是尝试提高速度的一个玩具示例:当并行线程数增加时(因为N减少),第一个for循环会变短.

I want to highlight again that this code is only a toy-example to try to see the speedup: the first for-cycle becomes shorter when the number of parallel threads increases (since N decreases).

但是,当我从1个线程扩展到2-4个线程时,迭代次数会按预期增加一倍;但是当我使用8-10-20线程时就不是这种情况:迭代次数不会随线程数线性增加.

However, when I go from 1 to 2-4 threads the number of iterations double up as expected; but this is not the case when I use 8-10-20 threads: the number of iterations does not increase linearly with the number of threads.

您能帮我吗?代码正确吗?我应该期待接近线性的加速吗?

Could you please help me with this? Is the code correct? Should I expect a near-linear speedup?

结果

运行上面的代码,我得到了以下结果.

Running the code above I got the following results.

1个线程:23次迭代.

1 thread: 23 iterations.

20个线程:每个线程397-401次迭代(而不是420-460).

20 threads: 397-401 iterations per thread (instead of 420-460).

推荐答案

您的测量方法错误.尤其是对于少量迭代.

Your measurement methodology is wrong. Especially for small number of iterations.

1个线程:3次迭代.

1 thread: 3 iterations.

3个报告的迭代实际上意味着 2个迭代在不到120秒的时间内完成.第三个花了更长的时间. 1次迭代的时间在40到60 s之间.

3 reported iterations actually means that 2 iterations finished in less than 120 s. The third one took longer. The time of 1 iteration is between 40 and 60 s.

2个线程:每个线程5次迭代(而不是6个).

2 threads: 5 iterations per thread (instead of 6).

4次迭代在不到120秒的时间内完成. 1次迭代的时间在24到30 s之间.

4 iterations finished in less than 120 s. The time of 1 iteration is between 24 and 30 s.

20个线程:每个线程40-44次迭代(而不是60个).

20 threads: 40-44 iterations per thread (instead of 60).

40次迭代在不到120秒的时间内完成. 1次迭代的时间在2.9到3 s之间.

40 iterations finished in less than 120 s. The time of 1 iteration is between 2.9 and 3 s.

如您所见,您的结果实际上与线性加速并不矛盾.

As you can see your results actually do not contradict linear speedup.

简单地执行和计时单个外部循环会更加简单和准确,您可能会看到几乎完美的线性加速.

It would be much simpler and accurate to simply execute and time one single outer loop and you will likely see almost perfect linear speedup.

您看不到线性加速的一些原因(并非详尽无遗)是

Some reasons (non exhaustive) why you don't see linear speedup are:

  1. 内存绑定性能.在带有n = 1000的玩具示例中情况并非如此.更笼统地说:争用共享资源(主内存,缓存,I/O).
  2. 线程之间的同步(例如关键部分).在您的玩具示例中并非如此.
  3. 线程之间的负载不平衡.在您的玩具示例中并非如此.
  4. 使用所有内核时,Turbo模式将使用较低的频率.这可能发生在您的玩具示例中.
  1. Memory bound performance. Not the case in your toy example with n = 1000. More general speaking: contention for a shared resource (main memory, caches, I/O).
  2. Synchronization between threads (e.g. critical sections). Not the case in your toy example.
  3. Load imbalance between threads. Not the case in your toy example.
  4. Turbo mode will use lower frequencies when all cores are utilized. This can happen in your toy example.

从您的玩具示例中,我想说,您可以通过更好地使用高级抽象来改善您对OpenMP的方法,例如for.

From your toy example I would say that your approach to OpenMP can be improved by better using the high level abstractions, e.g. for.

对于这种格式,更多的一般建议可能太笼统,并且需要有关非玩具示例的更多具体信息.

More general advise would be too broad for this format and require more specific information about the non-toy example.

这篇关于使用OpenMP无法加速的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆