使用 openMP 在向量 c++ 对象上并行化循环 [英] Using openMP to parallelize a loop over a vector c++ objects

查看:48
本文介绍了使用 openMP 在向量 c++ 对象上并行化循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 openMP 提高 C++ 代码的性能,但没有看到非常好的缩放.在深入研究我的代码的细节之前,我有一个非常笼统的问题,我认为如果我能得到明确的答案,可以节省很多时间.

I'm trying to increase the performance of a c++ code using openMP but am not seeing very good scaling. Before delving into the details of my code, I have a very general question that I think could save a lot of time if I can get a definitive answer to it.

代码的基本结构是一个对象向量(假设大小 num_objs = 5000),其中每个对象保存一个相对较小的双精度向量(假设大小 num_elems = 500).我想遍历这个对象向量,对于每个对象,在成员向量上执行一个子循环来修改每个元素.我只是试图并行化外循环(在对象上),因为这是 openMP 的标准方法,而且这个循环比嵌套循环大得多.

The basic structure of the code is a vector of objects (let's say size num_objs = 5000) where each object holds a relatively small vector of doubles (let's say size num_elems = 500). I want to loop through this vector of objects, and for each object, perform a subloop on the member vector to modify each element. I am only attempting to parallelize the outer loop (over the objects) as this is the standard approach with openMP and this loop is much larger than the nested one.

现在是我的问题.我是否通过循环遍历对象数组然后遍历每个较小的成员向量来严重影响性能?如果我创建一个大小为 num_objects * num_elems 的大向量,然后对块"进行并行循环,我是否应该期望性能显着提高?对应于存储在我上面描述的每个对象中的成员向量的那个大向量?那样的话,外循环和内循环都将从一个大向量中访问数据,而不必从不同的对象中获取数据?

So now for my question. Am I taking a severe performance hit by looping over the array of objects and then looping over each of their smaller member vectors? Should I expect significant increase in performance if I instead made one large vector of size num_objects * num_elems and then did a parallel loop over "chunks" of that big vector that would correspond to the member vectors stored in each object that I described above? That way both the outer loop and inner loop will be accessing data from one big vector rather than have to fetch data from separate objects?

实际代码比上面描述的要复杂得多,因此为了尝试这种替代方法需要大量时间修改.因此,我只是想感受一下,如果我花时间重构整个代码,我可以获得多大的加速.我对计算机体系结构、内存访问、缓存等方面的知识并不多,因此如果这很明显,我深表歉意.

The actual code is much more complicated than it would seem by the above description, so in order to try this alternative approach would require a lot of time modifying. Therefore, I just wanted to get a feel for how significant of a speedup I could get if I spent the time restructuring the entire code. I don't have a lot of knowledge about computer architecture, memory access, caches, etc., so apologies if this is painfully obvious.

我在想这可能有一个简单的答案;然而,我发现情况并非如此.请考虑以下(简化示例).

I was thinking there was possibly a simple answer to this; however, I see that's not really the case. Please consider the following (simplified example).

#include <cmath>
#include <ctime>
#include <iostream>
#include <omp.h>
#include <string>
#include <vector>

class Block {
public:
  static double a;
  std::vector<double> x;
  std::vector<double> y;
  Block(int N);
};

double Block::a = 5;

int main(int argc, char const *argv[]) {
  int num_blocks = 80000;
  int num_elems = 1000;
  int num_iter = 100;

  int nthreads = 1;
  bool parallel_on = true;

  omp_set_num_threads(nthreads);

  std::vector<Block> block_vec;

  for (int i = 0; i < num_blocks; i++) {
    block_vec.push_back(Block(num_elems));
  }

  double start;
  double end;
  start = omp_get_wtime();

  int iter = 0;

  while (iter < num_iter) {
#pragma omp parallel for if (parallel_on)
    for (int bl = 0; bl < num_blocks; bl++) {
      for (int i = 0; i < num_elems; i++) {
        block_vec[bl].x[i] = Block::a * block_vec[bl].y[i] + block_vec[bl].x[i];
      }
    }
    iter++;
    std::cout << "ITER: " << iter << std::endl;
  }

  end = omp_get_wtime();
  double time_taken = end - start;
  std::cout << "TIME: " << time_taken << std::endl;

  return 0;
}

Block::Block(int N) {
  x.assign(N, 2.0);
  y.assign(N, 3.0);
}

我编译这个程序:

g++ -fopenmp -O3 saxpy.cpp

我在 i7-6700 CPU @ 3.40GHz(四个物理内核和八个逻辑内核)上运行它.以下是不同线程数的计算时间:

I'm running it on an i7-6700 CPU @ 3.40GHz (four physical cores and eight logical cores). Here is the computational time for differing thread counts:

1 线程:8.65s
2线程:7.37s
3线程:7.41s
4线程:7.65s

1 THREAD: 8.65s
2 THREAD: 7.37s
3 THREAD: 7.41s
4 THREAD: 7.65s

我确实尝试了上述代码的一个版本,它使用一个大向量而不是嵌套循环;然而,它的结果大致相同,实际上慢了一点.

I did try a version of this code as I described above that makes use of one big vector rather than the nested loop; however, it was about the same result, actually a little slower.

推荐答案

你的程序的速度主要取决于内存读/写的速度(包括缓存利用率等).根据硬件,您可能会或可能不会观察到速度增加.有关更多详细信息,请阅读例如这个.

The speed of your program mainly depends on the speed of memory read/write (including cache utilization,etc). Depending on the hardware you may or may not observe speed increase. For more details please read e.g. this.

在我的笔记本电脑(i7-8550U,g++ -fopenmp -O3 -mavx2 saxpy.cpp)上我得到了类似的结果,但在至强服务器上我得到了显着的速度提升:

On my laptop (i7-8550U, g++ -fopenmp -O3 -mavx2 saxpy.cpp) I got similar result, but on a Xeon server I got significant speed improvement:

nthreads=1     
TIME: 13.0372
real    0m14.303s
user    0m13.206s
sys     0m1.096s

nthreads=4
TIME: 5.1537
real    0m5.921s
user    0m18.473s
sys     0m0.615s

nthreads=8
TIME: 3.43479
real    0m4.237s
user    0m27.337s
sys     0m0.608s

这篇关于使用 openMP 在向量 c++ 对象上并行化循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆