使用GCC(G ++)编译c ++ OpenACC并行CPU代码 [英] Compiling c++ OpenACC parallel CPU code using GCC (G++)

查看:223
本文介绍了使用GCC(G ++)编译c ++ OpenACC并行CPU代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当尝试使用配置有--enable-languages=c,c++,lto --disable-multilib的GCC-9.3.0(g ++)编译OpenACC代码时,以下代码不使用多个内核,而如果使用pgc ++编译器编译相同的代码,则使用多个内核./p>

g ++编译:g++ -lgomp -Ofast -o jsolve -fopenacc jsolvec.cpp

pgc ++编译:pgc++ -o jsolvec.exe jsolvec.cpp -fast -Minfo=opt -ta=multicore

OpenACC Tutorial1/solver中的代码 https://github.com/OpenACCuserGroup/openacc -users-group.git :

 // Jacobi iterative method for solving a system of linear equations
// This is guaranteed to converge if the matrix is diagonally dominant,
// so we artificially force the matrix to be diagonally dominant.
//  See https://en.wikipedia.org/wiki/Jacobi_method
//
//  We solve for vector x in Ax = b
//    Rewrite the matrix A as a
//       lower triangular (L),
//       upper triangular (U),
//    and diagonal matrix (D).
//
//  Ax = (L + D + U)x = b
//
//  rearrange to get: Dx = b - (L+U)x  -->   x = (b-(L+U)x)/D
//
//  we can do this iteratively: x_new = (b-(L+U)x_old)/D

// build with TYPE=double (default) or TYPE=float
// build with TOLERANCE=0.001 (default) or TOLERANCE= any other value
// three arguments:
//   vector size
//   maximum iteration count
//   frequency of printing the residual (every n-th iteration)

#include <cmath>
#include <omp.h>
#include <cstdlib>
#include <iostream>
#include <iomanip>

using std::cout;

#ifndef TYPE
#define TYPE double
#endif

#define TOLERANCE 0.001

void
init_simple_diag_dom(int nsize, TYPE* A)
{
  int i, j;

  // In a diagonally-dominant matrix, the diagonal element
  // is greater than the sum of the other elements in the row.
  // Scale the matrix so the sum of the row elements is close to one.
  for (i = 0; i < nsize; ++i) {
    TYPE sum;
    sum = (TYPE)0;
    for (j = 0; j < nsize; ++j) {
      TYPE x;
      x = (rand() % 23) / (TYPE)1000;
      A[i*nsize + j] = x;
      sum += x;
    }
    // Fill diagonal element with the sum
    A[i*nsize + i] += sum;

    // scale the row so the final matrix is almost an identity matrix
    for (j = 0; j < nsize; j++)
      A[i*nsize + j] /= sum;
  }
} // init_simple_diag_dom

int
main(int argc, char **argv)
{
  int nsize; // A[nsize][nsize]
  int i, j, iters, max_iters, riter;
  double start_time, elapsed_time;
  TYPE residual, err, chksum;
  TYPE *A, *b, *x1, *x2, *xnew, *xold, *xtmp;

  // set matrix dimensions and allocate memory for matrices
  nsize = 0;
  if (argc > 1)
    nsize = atoi(argv[1]);
  if (nsize <= 0)
    nsize = 1000;

  max_iters = 0;
  if (argc > 2)
    max_iters = atoi(argv[2]);
  if (max_iters <= 0)
    max_iters = 5000;

  riter = 0;
  if (argc > 3)
    riter = atoi(argv[3]);
  if (riter <= 0)
    riter = 200;

  cout << "nsize = " << nsize << ", max_iters = " << max_iters << "\n";

  A = new TYPE[nsize*nsize];

  b = new TYPE[nsize];
  x1 = new TYPE[nsize];
  x2 = new TYPE[nsize];

  // generate a diagonally dominant matrix
  init_simple_diag_dom(nsize, A);

  // zero the x vectors, random values to the b vector
  for (i = 0; i < nsize; i++) {
    x1[i] = (TYPE)0.0;
    x2[i] = (TYPE)0.0;
    b[i] = (TYPE)(rand() % 51) / 100.0;
  }

  start_time = omp_get_wtime();

  //
  // jacobi iterative solver
  //

  residual = TOLERANCE + 1.0;
  iters = 0;
  xnew = x1;    // swap these pointers in each iteration
  xold = x2;
  while ((residual > TOLERANCE) && (iters < max_iters)) {
    ++iters;
    // swap input and output vectors
    xtmp = xnew;
    xnew = xold;
    xold = xtmp;

    #pragma acc parallel loop
    for (i = 0; i < nsize; ++i) {
      TYPE rsum = (TYPE)0;
      #pragma acc loop reduction(+:rsum)
      for (j = 0; j < nsize; ++j) {
        if (i != j) rsum += A[i*nsize + j] * xold[j];
      }
      xnew[i] = (b[i] - rsum) / A[i*nsize + i];
    }
    //
    // test convergence, sqrt(sum((xnew-xold)**2))
    //
    residual = 0.0;
    #pragma acc parallel loop reduction(+:residual)
    for (i = 0; i < nsize; i++) {
      TYPE dif;
      dif = xnew[i] - xold[i];
      residual += dif * dif;
    }
    residual = sqrt((double)residual);
    if (iters % riter == 0 ) cout << "Iteration " << iters << ", residual is " << residual << "\n";
  }
  elapsed_time = omp_get_wtime() - start_time;
  cout << "\nConverged after " << iters << " iterations and " << elapsed_time << " seconds, residual is " << residual << "\n";

  //
  // test answer by multiplying my computed value of x by
  // the input A matrix and comparing the result with the
  // input b vector.
  //
  err = (TYPE)0.0;
  chksum = (TYPE)0.0;

  for (i = 0; i < nsize; i++) {
    TYPE tmp;
    xold[i] = (TYPE)0.0;
    for (j = 0; j < nsize; j++)
      xold[i] += A[i*nsize + j] * xnew[j];
    tmp = xold[i] - b[i];
    chksum += xnew[i];
    err += tmp * tmp;
  }
  err = sqrt((double)err);
  cout << "Solution error is " << err << "\n";
  if (err > TOLERANCE)
    cout << "****** Final Solution Out of Tolerance ******\n" << err << " > " << TOLERANCE << "\n";

  delete A;
  delete b;
  delete x1;
  delete x2;
  return 0;
}
 

解决方案

在GCC中尚不支持使用OpenACC将并行循环调度到多核CPU上.当然,使用OpenMP可以解决此问题,并且您可以将代码与混合的OpenACC(用于代码卸载,如代码中已存在的)和OpenMP指令(用于CPU并行化,代码中尚不存在的)混合使用,以便采用相应的机制是否使用-fopenacc-fopenmp进行编译.

就像PGI一样,它当然可以在GCC中得到支持;我们当然可以实现,但尚未安排好,尚未为GCC筹集资金.

When trying to compile OpenACC code with GCC-9.3.0 (g++) configured with --enable-languages=c,c++,lto --disable-multilib the following code does not use multiple cores, whereas if the same code is compiled with the pgc++ compiler it does use multiple cores.

g++ compilation: g++ -lgomp -Ofast -o jsolve -fopenacc jsolvec.cpp

pgc++ compilation: pgc++ -o jsolvec.exe jsolvec.cpp -fast -Minfo=opt -ta=multicore

Code from OpenACC Tutorial1/solver https://github.com/OpenACCuserGroup/openacc-users-group.git:

// Jacobi iterative method for solving a system of linear equations
// This is guaranteed to converge if the matrix is diagonally dominant,
// so we artificially force the matrix to be diagonally dominant.
//  See https://en.wikipedia.org/wiki/Jacobi_method
//
//  We solve for vector x in Ax = b
//    Rewrite the matrix A as a
//       lower triangular (L),
//       upper triangular (U),
//    and diagonal matrix (D).
//
//  Ax = (L + D + U)x = b
//
//  rearrange to get: Dx = b - (L+U)x  -->   x = (b-(L+U)x)/D
//
//  we can do this iteratively: x_new = (b-(L+U)x_old)/D

// build with TYPE=double (default) or TYPE=float
// build with TOLERANCE=0.001 (default) or TOLERANCE= any other value
// three arguments:
//   vector size
//   maximum iteration count
//   frequency of printing the residual (every n-th iteration)

#include <cmath>
#include <omp.h>
#include <cstdlib>
#include <iostream>
#include <iomanip>

using std::cout;

#ifndef TYPE
#define TYPE double
#endif

#define TOLERANCE 0.001

void
init_simple_diag_dom(int nsize, TYPE* A)
{
  int i, j;

  // In a diagonally-dominant matrix, the diagonal element
  // is greater than the sum of the other elements in the row.
  // Scale the matrix so the sum of the row elements is close to one.
  for (i = 0; i < nsize; ++i) {
    TYPE sum;
    sum = (TYPE)0;
    for (j = 0; j < nsize; ++j) {
      TYPE x;
      x = (rand() % 23) / (TYPE)1000;
      A[i*nsize + j] = x;
      sum += x;
    }
    // Fill diagonal element with the sum
    A[i*nsize + i] += sum;

    // scale the row so the final matrix is almost an identity matrix
    for (j = 0; j < nsize; j++)
      A[i*nsize + j] /= sum;
  }
} // init_simple_diag_dom

int
main(int argc, char **argv)
{
  int nsize; // A[nsize][nsize]
  int i, j, iters, max_iters, riter;
  double start_time, elapsed_time;
  TYPE residual, err, chksum;
  TYPE *A, *b, *x1, *x2, *xnew, *xold, *xtmp;

  // set matrix dimensions and allocate memory for matrices
  nsize = 0;
  if (argc > 1)
    nsize = atoi(argv[1]);
  if (nsize <= 0)
    nsize = 1000;

  max_iters = 0;
  if (argc > 2)
    max_iters = atoi(argv[2]);
  if (max_iters <= 0)
    max_iters = 5000;

  riter = 0;
  if (argc > 3)
    riter = atoi(argv[3]);
  if (riter <= 0)
    riter = 200;

  cout << "nsize = " << nsize << ", max_iters = " << max_iters << "\n";

  A = new TYPE[nsize*nsize];

  b = new TYPE[nsize];
  x1 = new TYPE[nsize];
  x2 = new TYPE[nsize];

  // generate a diagonally dominant matrix
  init_simple_diag_dom(nsize, A);

  // zero the x vectors, random values to the b vector
  for (i = 0; i < nsize; i++) {
    x1[i] = (TYPE)0.0;
    x2[i] = (TYPE)0.0;
    b[i] = (TYPE)(rand() % 51) / 100.0;
  }

  start_time = omp_get_wtime();

  //
  // jacobi iterative solver
  //

  residual = TOLERANCE + 1.0;
  iters = 0;
  xnew = x1;    // swap these pointers in each iteration
  xold = x2;
  while ((residual > TOLERANCE) && (iters < max_iters)) {
    ++iters;
    // swap input and output vectors
    xtmp = xnew;
    xnew = xold;
    xold = xtmp;

    #pragma acc parallel loop
    for (i = 0; i < nsize; ++i) {
      TYPE rsum = (TYPE)0;
      #pragma acc loop reduction(+:rsum)
      for (j = 0; j < nsize; ++j) {
        if (i != j) rsum += A[i*nsize + j] * xold[j];
      }
      xnew[i] = (b[i] - rsum) / A[i*nsize + i];
    }
    //
    // test convergence, sqrt(sum((xnew-xold)**2))
    //
    residual = 0.0;
    #pragma acc parallel loop reduction(+:residual)
    for (i = 0; i < nsize; i++) {
      TYPE dif;
      dif = xnew[i] - xold[i];
      residual += dif * dif;
    }
    residual = sqrt((double)residual);
    if (iters % riter == 0 ) cout << "Iteration " << iters << ", residual is " << residual << "\n";
  }
  elapsed_time = omp_get_wtime() - start_time;
  cout << "\nConverged after " << iters << " iterations and " << elapsed_time << " seconds, residual is " << residual << "\n";

  //
  // test answer by multiplying my computed value of x by
  // the input A matrix and comparing the result with the
  // input b vector.
  //
  err = (TYPE)0.0;
  chksum = (TYPE)0.0;

  for (i = 0; i < nsize; i++) {
    TYPE tmp;
    xold[i] = (TYPE)0.0;
    for (j = 0; j < nsize; j++)
      xold[i] += A[i*nsize + j] * xnew[j];
    tmp = xold[i] - b[i];
    chksum += xnew[i];
    err += tmp * tmp;
  }
  err = sqrt((double)err);
  cout << "Solution error is " << err << "\n";
  if (err > TOLERANCE)
    cout << "****** Final Solution Out of Tolerance ******\n" << err << " > " << TOLERANCE << "\n";

  delete A;
  delete b;
  delete x1;
  delete x2;
  return 0;
}

解决方案

It's not yet supported in GCC to use OpenACC to schedule parallel loops onto multicore CPUs. Using OpenMP works for that, of course, and you can have code with mixed OpenACC (for GPU offloading, as already present in your code) and OpenMP directives (for CPU parallelization, not yet present in your code), so that the respective mechanism will be used depending on whether compiling with -fopenacc vs. -fopenmp.

Like PGI are doing, it certainly can be supported in GCC; we'll certainly be able to implement that, but it has not yet been scheduled, has not yet been funded for GCC.

这篇关于使用GCC(G ++)编译c ++ OpenACC并行CPU代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆