使用OpenACC并行化嵌套循环 [英] Using OpenACC to parallelize nested loops

查看:191
本文介绍了使用OpenACC并行化嵌套循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对openacc还是很陌生,并且只有高水平的知识,所以对我做错事的任何帮助和解释将不胜感激.

I am very new to openacc and have just high-level knowledge so any help and explanation of what I am doing wrong would be appreciated.

我正在尝试加速(并行化)一个不太简单的嵌套循环,该循环使用openacc指令更新展平的(3D到1D)数组.在使用

I am trying to accelerate(parallelize) a not so straightforward nested loop that updates a flattened (3D to 1D) array using openacc directives. I have posted a simplified sample code below that when compiled using

pgcc -acc -Minfo=accel test.c

出现以下错误:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

代码:

#include <stdio.h>
#include <stdlib.h>

#define min(a,b) (a > b) ? b : a
#define max(a,b) (a < b) ? b : a

#define NX 10
#define NY 10
#define NZ 10

struct phiType {
  double dx, dy, dz;
  double * distance;
};

typedef struct phiType Phi;

#pragma acc routine seq
double solve(Phi *p, int index) {
  // for simplicity just returning a value
  return 2;
}

void fast_sweep(Phi *p) {

  // removing boundaries
  int x = NX - 2; 
  int y = NY - 2;
  int z = NZ - 2;

  int startLevel = 3;
  int endLevel   = x + y + z;

  #pragma acc data copy(p->distance[0:NX*NY*NZ])
  for(int level = startLevel; level <= endLevel; level++){
    int ks = max(1, level-(y + z));
    int ke = min(x, level-2);

    int js = max(1, level-(x + z));
    int je = min(y, level-2);

    #pragma acc region
    {
      #pragma acc loop independent
      for(int k = ks; k <= ke; k++){
        #pragma acc loop independent
        for(int j = js; j <= je; j++){
          int i = level - (k + j);
          if(i > 0 && i <= z){
            int index = i * NX * NY + j * NX + k;
            p->distance[index] = solve(p, index);
          }
        }
      }
    }
  }
}


void create_phi(Phi *p){

  p->dx = 1;
  p->dy = 1;
  p->dz = 1;

  p->distance = (double *) malloc(sizeof(double) * NX * NY * NZ);
  for(int i = 0; i < NZ; i++){
    for(int j = 0; j < NY; j++){
      for(int k = 0; k < NX; k++){
        int index = i * NX * NY + j * NX + k;
        p->distance[index] = (i*j*k == 0) ? 0 : 1;
      }
    }
  }

}


int main()
{
  printf("start \n");

  Phi *p = (Phi *) malloc(sizeof(Phi));
  create_phi(p);

  printf("calling fast sweep \n");
  fast_sweep(p);

  printf(" print the results \n");
  for(int i = 0; i < NZ; i++){
    for(int j = 0; j < NY; j++){
      for(int k = 0; k < NX; k++){
        int index = i * NX * NY + j * NX + k;
        printf("%f ", p->distance[index]);
      }
      printf("\n");
    }
    printf("\n");
  }

  return 0;
}

使用

#pragma acc kernels

产生以下错误:

solve:
     19, Generating acc routine seq
fast_sweep:
     34, Generating copy(p->distance[:1000])
     42, Generating copy(p[:1])
     45, Loop carried dependence due to exposed use of p[:1] prevents parallelization
         Accelerator scalar kernel generated
     47, Loop carried dependence due to exposed use of p[:i1+1] prevents parallelization

我正在运行此代码

GNU/Linux
CentOS release 6.7 (Final)
GeForce GTX Titan
pgcc 15.7-0 64-bit target on x86-64 Linux -tp sandybridge 

推荐答案

该错误来自GPU上的计算内核取消引用了CPU指针.这是一个非常普遍的问题,也是OpenACC委员会正在努力解决的问题.像这样的动态数据结构确实会引起很多问题,因此我们要对其进行修复.这是两种可能的解决方法.

The error is coming from the compute kernel on the GPU dereferencing a CPU pointer. This is a pretty common problem and something that the OpenACC committee is working on solving. Dynamic data structures like these can really cause a lot of problems, so we want to fix it. Here's two possible workarounds for you.

1)在编译器安装过程中,通过PGI统一内存评估包"选项使用托管内存".这是一个 beta 功能,但是它将所有数据放入特殊类型的内存中,该内存对于CPU和GPU都是可见的.您应该在文档中阅读很多警告,大多数情况是,您限于GPU上可用的内存量,并且在GPU上使用CPU时无法从CPU访问该内存,但是一种可能的解决方法.假设您在安装过程中启用了该选项,只需在编译器标志中添加-ta=tesla:managed即可将其打开.我用您的代码尝试了一下,并且有效.

1) Use "managed memory" via the PGI "unified memory evaluation package" option during compiler installation. This is a beta feature, but it will put all of your data into a special type of memory that is visible to both the CPU and GPU. There's a lot of caveats that you should read about in the documentation, most namely that you're limited to the amount of memory available on the GPU and that you cannot access the memory from the CPU while it's being used on the GPU, but it's one possible workaround. Assuming you enabled that option during installation, just add -ta=tesla:managed to your compiler flags to turn it on. I tried this with your code and it worked.

2)添加一个指向代码的指针,这样您就不会通过p访问distance,而是直接访问它,就像这样:

2) Add a pointer to your code so that you're not accessing distance through p, but your accessing it directly, like so:

double *distance = p->distance;
#pragma acc data copy(p[0:1],distance[0:NX*NY*NZ])
  for(int level = startLevel; level <= endLevel; level++){
    int ks = max(1, level-(y + z));
    int ke = min(x, level-2);

    int js = max(1, level-(x + z));
    int je = min(y, level-2);

    #pragma acc parallel
    {
      #pragma acc loop independent
      for(int k = ks; k <= ke; k++){
        #pragma acc loop independent
        for(int j = js; j <= je; j++){
          int i = level - (k + j);
          if(i > 0 && i <= z){
            int index = i * NX * NY + j * NX + k;
            distance[index] = solve(p, index);
          }
        }
      }
    }

我知道,当要执行许多数据数组操作时,这可能会很痛苦,但这是我已经在很多代码中成功使用的一种解决方法.不幸的是,这是必须的,这就是为什么我们希望在以后的OpenACC版本中提供更好的解决方案.

I know this can be a pain when there's a lot of data arrays to do this to, but it's a workaround that I've used successfully in a lot of codes. It's unfortunate that this is necessary, which is why we'd like to provide a better solution in a future version of OpenACC.

我希望这会有所帮助!如果我能提出不需要额外指针的解决方案,那么我将更新此答案.

I hope this helps! If I can come up with a solution that doesn't require the extra pointer, I'll update this answer.

这篇关于使用OpenACC并行化嵌套循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆