在C ++中提高矩阵(多维数组)的OpenMP多线程并行计算效率 [英] Improve OpenMP multi-threading parallel computing efficiency for matrix (multi-dimensional array) in c++

查看:274
本文介绍了在C ++中提高矩阵(多维数组)的OpenMP多线程并行计算效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始使用OpenMP在C ++中进行并行计算.该程序的并行性能很差.由于我不了解许多多线程分析工具(与单线程的简单gprof不同),我编写了一个示例程序来测试性能.

I just started to use OpenMP to do parallel computing in C++. The program has a bad parallel performance. Since I don't know many multi-threading profiling tool (unlike simple gprof for single thread), I wrote a sample program to test the performance.

我有一个2D矩阵(N * N),每个元素都有一个3d向量(x,y,z).我只是简单地执行double for循环来设置矩阵中的每个值:

I have a 2D matrix(N * N), with each element a 3d vector(x, y, z). I simply do a double for loop to set each value in the matrix:

for (int i = 0; i < N; ++i) {
  for (int j = 0; j < N; ++j) {
    vectorStack[i][j] = VECTOR3D(1.0*i*i, 1.0*j*j, 1.0*i*j);
  }
}

其中VECTOR3D是一个简单的类,具有x, y, z属性:

where VECTOR3D is a simple class has x, y, z attributes:

class VECTOR3D {
    double x, y, z;                // component along each axis 
}

另一方面,我也可以使用(N * N * 3)3D数组来做到这一点:

On the other hand, I can also use a (N * N * 3) 3D array to do this:

for (int i = 0; i < N; ++i) {
  for (int j = 0; j < N; ++j) {
    arrayHeap[i][j][0] = (1.0*i*i);
    arrayHeap[i][j][1] = (1.0*j*j);
    arrayHeap[i][j][2] = (1.0*i*j);
  }
}

从内存方面来说,还有几种不同的选择,例如将裸指针与手动分配和取消分配配合使用:

From the memory aspect, there are also several different choices, such as use raw pointers with manually allocate and deallocate:

  double ***arrayHeap;
  arrayHeap = new double** [N];
  for(int i = 0; i < N; ++i) {
    arrayHeap[i] = new double* [N];
    for(int j = 0; j < N; ++j) {
      arrayHeap[i][j] = new double[3];
    }
  }

或直接使用std::vector:

  vector< vector<VECTOR3D>> vectorStack(N, vector<VECTOR3D>(N, VECTOR3D(0, 0, 0)));

我还考虑过像 LAMMP(分子模拟包源代码)那样,为阵列手动分配连续内存.

I also considered to manually allocate continuous memory for arrays as did in LAMMP(Molecular Simulation Package source code.

所以我在N=10000中的结果在这里列出:

So my results for N=10000 are listed here:

对于单线程:

OMP_NUM_THREADS = 1 ./a.out

OMP_NUM_THREADS=1 ./a.out

为堆上的数组分配内存...

Allocating memory for array on heap...

========堆结果上的数组=======

======= Array on heap Results =======

在时间(总计)内完成: 0.720385

Finished within time (total): 0.720385 seconds

在时间(真实)内完成: 0.720463

Finished within time (real): 0.720463 seconds

正在为堆上的数组分配内存...

Deallocating memory for array on heap...

为数组连续分配内存...

Allocating memory for array continous...

========数组连续结果=======

======= Array continuous Results =======

在时间(总计)内完成: 0.819733

Finished within time (total): 0.819733 seconds

在时间(真实)内完成: 0.819774

Finished within time (real): 0.819774 seconds

为数组连续分配内存...

Deallocating memory for array continuous...

为堆上的向量分配内存...

Allocating memory for vector on heap...

=======堆上的向量=======

======= Vector on heap Results =======

在时间(总计)内完成: 3.08715

Finished within time (total): 3.08715 seconds

在真实时间内完成: 3.08725

正在为堆上的向量分配内存...

Deallocating memory for vector on heap...

为堆栈上的向量分配内存...

Allocating memory for vector on stack...

========堆栈上的向量结果=======

======= Vector on stack Results =======

在时间(总计)内完成: 1.49888

Finished within time (total): 1.49888 seconds

在时间(真实)内完成: 1.49895

Finished within time (real): 1.49895 seconds

对于多线程(线程= 4):

OMP_NUM_THREADS = 4 ./a.out

OMP_NUM_THREADS=4 ./a.out

为堆上的数组分配内存...

Allocating memory for array on heap...

========堆结果上的数组=======

======= Array on heap Results =======

在时间(总计)内完成: 2.29184

Finished within time (total): 2.29184 seconds

在时间(真实)内完成: 0.577807

Finished within time (real): 0.577807 seconds

正在为堆上的数组分配内存...

Deallocating memory for array on heap...

为数组连续分配内存...

Allocating memory for array continous...

========数组连续结果=======

======= Array continuous Results =======

在时间(总计)内完成: 1.79501

Finished within time (total): 1.79501 seconds

在时间(真实)内完成: 0.454139

Finished within time (real): 0.454139 seconds

为数组连续分配内存...

Deallocating memory for array continuous...

为堆上的向量分配内存...

Allocating memory for vector on heap...

=======堆上的向量=======

======= Vector on heap Results =======

在时间(总计)内完成: 6.80917

Finished within time (total): 6.80917 seconds

在时间(真实)内完成: 1.92541

Finished within time (real): 1.92541 seconds

正在为堆上的向量分配内存...

Deallocating memory for vector on heap...

为堆栈上的向量分配内存...

Allocating memory for vector on stack...

========堆栈上的向量结果=======

======= Vector on stack Results =======

在时间(总计)内完成: 1.64086

Finished within time (total): 1.64086 seconds

在时间(真实)内完成: 0.411

Finished within time (real): 0.411 seconds

总体并行效率不好.出乎意料的是,花哨的连续内存分配对您没有帮助吗?为什么会这样呢?看来std::vector足以应付这种情况了?

The overall parallel efficiency is not good. Unexpected, the fancy continuous memory allocating is not helpful?! Why is this happening? It seems the std::vector is good enough for this case?

有人可以帮我解释一下结果吗?而且我还需要有关如何改进代码的建议.

Could someone explain the results for me? and I also need suggestions on how to improve the code.

非常感谢!

随附了所有源代码. (请直接转到main,开始时有几个功能可以手动管理内存).

Attached all the source code. (please directly go to main, there are several functions to manually manage the memory at the beginning).

#include <iostream>
#include <omp.h>
#include <vector>
#include <stdlib.h>
#include <cinttypes>
#include "vector3d.h"

typedef int64_t bigint;

void *smalloc(bigint nbytes, const char *name)
{
  if (nbytes == 0) return NULL;
  void *ptr = malloc(nbytes);
  return ptr;
}

template <typename TYPE>
TYPE ***create(TYPE ***&array, int n1, int n2, int n3, const char *name)
{
  bigint nbytes = ((bigint) sizeof(TYPE)) * n1*n2*n3;
  TYPE *data = (TYPE *) smalloc(nbytes,name);
  nbytes = ((bigint) sizeof(TYPE *)) * n1*n2;
  TYPE **plane = (TYPE **) smalloc(nbytes,name);
  nbytes = ((bigint) sizeof(TYPE **)) * n1;
  array = (TYPE ***) smalloc(nbytes,name);

  int i,j;
  bigint m;
  bigint n = 0;
  for (i = 0; i < n1; i++) {
    m = ((bigint) i) * n2;
    array[i] = &plane[m];
    for (j = 0; j < n2; j++) {
      plane[m+j] = &data[n];
      n += n3;
    }
  }
  return array;
}

template <typename TYPE>
TYPE ***create3d_offset(TYPE ***&array, int n1lo, int n1hi,
                        int n2, int n3, const char *name)
{
  int n1 = n1hi - n1lo + 1;
  create(array,n1,n2,n3,name);
  array -= n1lo;
  return array;
}

void sfree(void *ptr) {
  if (ptr == NULL) return;
  free(ptr);
}


template <typename TYPE>
void destroy(TYPE ***&array)
{
  if (array == NULL) return;
  sfree(array[0][0]);
  sfree(array[0]);
  sfree(array);
  array = NULL;
}

template <typename TYPE>
void destroy3d_offset(TYPE ***&array, int offset)
{
  if (array == NULL) return;
  sfree(&array[offset][0][0]);
  sfree(&array[offset][0]);
  sfree(&array[offset]);
  array = NULL;
}


////////////////////////////////////////////////////////
////////////////////////////////////////////////////////
////////////////////////////////////////////////////////

int main() {
  using namespace std;

  const int N = 10000;


  ///////////////////////////////////////

  double sum = 0.0;
  clock_t t;
  double startTime, stopTime, secsElapsed;


  printf("\n\nAllocating memory for array on heap...\n");
  double ***arrayHeap;
  arrayHeap = new double** [N];
  for(int i = 0; i < N; ++i) {
    arrayHeap[i] = new double* [N];
    for(int j = 0; j < N; ++j) {
      arrayHeap[i][j] = new double[3];
    }
  }


  printf("======= Array on heap Results =======\n");

  sum = 0.0;
  t = clock();
  startTime = omp_get_wtime();

#pragma omp parallel
  {
//#pragma omp for schedule(dynamic)
//#pragma omp for collapse(2)
#pragma omp for
    for (int i = 0; i < N; ++i) {
      for (int j = 0; j < N; ++j) {
        arrayHeap[i][j][0] = (1.0*i*i);
        arrayHeap[i][j][1] = (1.0*j*j);
        arrayHeap[i][j][2] = (1.0*i*j);
      }
    }

  }

  t = clock() - t;
  cout << "Finished within time (total): " << ((double) t) / CLOCKS_PER_SEC << " seconds" << endl;
  stopTime = omp_get_wtime();
  secsElapsed = stopTime - startTime;
  cout << "Finished within time (real): " << secsElapsed << " seconds" << endl;

  printf("Deallocating memory for array on heap...\n");
  for(int i = 0; i < N; ++i) {
    for(int j = 0; j < N; ++j) {
      delete [] arrayHeap[i][j];
    }
    delete [] arrayHeap[i];
  }
  delete [] arrayHeap;


  ///////////////////////////////////////


  printf("\n\nAllocating memory for array continous...\n");
  double ***array_continuous;
  create3d_offset(array_continuous,0, N, N, 3, "array");

  printf("======= Array continuous Results =======\n");

  sum = 0.0;
  t = clock();
  startTime = omp_get_wtime();

#pragma omp parallel
  {
//#pragma omp for schedule(dynamic)
//#pragma omp for collapse(2)
#pragma omp for
    for (int i = 0; i < N; ++i) {
      for (int j = 0; j < N; ++j) {
        array_continuous[i][j][0] = (1.0*i*i);
        array_continuous[i][j][1] = (1.0*j*j);
        array_continuous[i][j][2] = (1.0*i*j);
      }
    }

  }

  t = clock() - t;
  cout << "Finished within time (total): " << ((double) t) / CLOCKS_PER_SEC << " seconds" << endl;
  stopTime = omp_get_wtime();
  secsElapsed = stopTime - startTime;
  cout << "Finished within time (real): " << secsElapsed << " seconds" << endl;


  printf("Deallocating memory for array continuous...\n");
  destroy3d_offset(array_continuous, 0);



///////////////////////////////////////k



  printf("\n\nAllocating memory for vector on heap...\n");
  VECTOR3D ***vectorHeap;
  vectorHeap = new VECTOR3D**[N];
  for(int i = 0; i < N; ++i) {
    vectorHeap[i] = new VECTOR3D* [N];
    for(int j = 0; j < N; ++j) {
      vectorHeap[i][j] = new VECTOR3D(0,0,0);
    }
  }

  printf("======= Vector on heap Results =======\n");
  sum = 0.0;
  t = clock();
  startTime = omp_get_wtime();

#pragma omp parallel
  {
//#pragma omp for schedule(dynamic)
//#pragma omp for collapse(2)
#pragma omp for
    for (int i = 0; i < N; ++i) {
      for (int j = 0; j < N; ++j) {
        vectorHeap[i][j] = new VECTOR3D(1.0*i*i, 1.0*j*j, 1.0*i*j);
      }
    }

  }

  t = clock() - t;
  cout << "Finished within time (total): " << ((double) t) / CLOCKS_PER_SEC << " seconds" << endl;
  stopTime = omp_get_wtime();
  secsElapsed = stopTime - startTime;
  cout << "Finished within time (real): " << secsElapsed << " seconds" << endl;

  printf("Deallocating memory for vector on heap...\n");
  for(int i = 0; i < N; ++i) {
    for(int j = 0; j < N; ++j) {
      delete [] vectorHeap[i][j];
    }
    delete [] vectorHeap[i];
  }
  delete [] vectorHeap;


  /////////////////////////////////////////////////

  printf("\n\nAllocating memory for vector on stack...\n");
  vector< vector<VECTOR3D>> vectorStack(N, vector<VECTOR3D>(N, VECTOR3D(0, 0, 0)));

  printf("======= Vector on stack Results =======\n");
  sum = 0.0;
  t = clock();
  startTime = omp_get_wtime();

#pragma omp parallel
  {
//#pragma omp for schedule(dynamic)
//#pragma omp for collapse(2)
#pragma omp for
    for (int i = 0; i < N; ++i) {
      for (int j = 0; j < N; ++j) {
        vectorStack[i][j] = VECTOR3D(1.0*i*i, 1.0*j*j, 1.0*i*j);
      }
    }

  }

  t = clock() - t;
  cout << "Finished within time (total): " << ((double) t) / CLOCKS_PER_SEC << " seconds" << endl;
  stopTime = omp_get_wtime();
  secsElapsed = stopTime - startTime;
  cout << "Finished within time (real): " << secsElapsed << " seconds" << endl;


  /////////////////////////////////



  return 0;
}

VECTOR3D类:

#ifndef _VECTOR3D_H
#define _VECTOR3D_H

#include <iostream>
#include <cmath>
#include <iomanip>
#include <limits>

class VECTOR3D {
public:
  double x, y, z;                // component along each axis (cartesian)
  VECTOR3D(double xx = 0.0, double yy = 0.0, double zz = 0.0) : x(xx), y(yy), z(zz)    // make a 3d vector
  {
  }
}

推荐答案

一般性误解

您的琐碎循环不受计算限制,而是完全受内存限制:您只能访问每个元素一次.没有重用意味着您不能有效地使用缓存.因此,您不能期望加速等于使用的线程/核心数.实际的速度取决于特定的系统(内存带宽).

General Misconception

Your trivial loop is not compute bound, but entirely memory bound: You access each element only once. No re-use means that you cannot efficiently use caches. Therefore you can not expect a speedup equal to the number of used threads/cores. The actual speedup depends on the specific system (memory bandwidth).

您所有的数据结构,包括花式连续内存,都会对数据访问执行许多间接操作.这不是严格必要的.要充分利用连续内存的优势,您只需将二维数组平面布局:

All of your data structures, including the fancy continuous memory perform many indirections on the data access. This is not strictly necessary. To gain full advantage of the continuous memory, you should simply lay out your 2d array flat:

template<class T>
class Array2d
{
public:
    Array2d(size_t rows, size_t columns) : rows_(rows), columns_(columns), data_(rows_ * columns_) {}
    T& operator()(size_t r, size_t c)
    {
        return data_[r * columns_ + c];
    }

    const T& operator()(size_t r, size_t c) const
    {
        return data_[r * columns_ + c];
    }

private:
    size_t rows_;
    size_t columns_;
    std::vector<T> data_;
};

注意:如果您确实必须保留[i][j]索引,还可以创建一个精美的operator[]返回一个提供另一个operator[]的代理对象.

Note: You could also make a fancy operator[] that returns a proxy object providing another operator[] if you really must retain the [i][j] indexing.

如果受内存带宽的限制并且N足够大,则间接寻址或平面布局之间不会存在明显的性能差异.

If you are limited by memory bandwidth and N is large enough, there will be no noticeable performance difference between indirection or flat layout.

这篇关于在C ++中提高矩阵(多维数组)的OpenMP多线程并行计算效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆