具有数据依赖性的for循环的矢量化 [英] Vectorisation of for-loop with data dependecy

查看:111
本文介绍了具有数据依赖性的for循环的矢量化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个基于BiCCG(共轭梯度)的矩阵求解器,该矩阵求解器也考虑了周期性.碰巧的情况是,实现是计算密集型的,并且由于依赖性问题,循环没有自动矢量化.我进行了一些探索,似乎红黑高斯seidel算法比普通版本(也有类似的依赖问题)更有效地并行化.

可以更改此循环/算法以便对其进行有效的矢量化吗?

 // FORWARD
        #pragma omp for schedule(static, nx/NTt)
        for (i=1; i<=nx; i++) for (j=1; j<=ny; j++) for (k=1; k<=nz; k++)
        {


            dummy = res_sparse_s[i][j][k];

                                           dummy -= COEFF[i][j][k][7] * RLL[i-1][j][k];
            if (PeriodicBoundaryX && i==nx)dummy -= COEFF[i][j][k][8] * RLL[1  ][j][k];


                                            dummy -= COEFF[i][j][k][2] * RLL[i][j-1][k];
            if (PeriodicBoundaryY && j==ny) dummy -= COEFF[i][j][k][3] * RLL[i][1  ][k];


                                            dummy -= COEFF[i][j][k][4] * RLL[i][j][k-1];
            if (PeriodicBoundaryZ && k==nz) dummy -= COEFF[i][j][k][5] * RLL[i][j][1  ];


            RLL[i][j][k] = dummy / h_sparse_s[i][j][k];
        }

P.S.任何迭代i,j,k的RLL都通过变量哑元在i-1,j-1和k-1处合并了更新的"RLL".同样现在,仅使用指令schedule(static, nx/NTt)在x方向上分解循环,其中NTt只是可用线程数的宏.可以使用指令collapse在所有方向上对其进行分解吗?

---------主要编辑-------------------------- 遵循Ajay的回答,这里是一个最小的工作示例

#include<stdio.h>
#include<stdlib.h>
#include<time.h>
#include<omp.h>

typedef double lr;

#define nx 4
#define ny 4
#define nz 4

void
print3dmatrix(double a[nx+2][ny+2][nz+2])
{
    for(int i=1; i<= nx; i++) {
        for(int j=1; j<= ny; j++) {
            for(int k=1; k<= nz; k++) {
                printf("%f ", a[i][j][k]);
            }
            printf("\n");
        }
        printf("\n");
    }
}

int 
main()
{

    double a[nx+2][ny+2][nz+2];
    double b[nx+2][ny+2][nz+2];

    srand(3461833726);


    // matrix filling 
    // b is just a copy of a
    for(int i=0; i< nx+2; i++) for(int j=0; j< ny+2; j++) for(int k=0; k< nz+2; k++)
    {
        a[i][j][k] = rand() % 5;
        b[i][j][k] = a[i][j][k];
    }

    // loop 1
    //#pragma omp parallel for num_threads(1)
    for(int i=1; i<= nx; i++) for(int j=1; j<= ny; j++) for(int k=1; k<= nz; k++)
    {
        a[i][j][k] = -1*a[i-1][j][k] - 1*a[i][j-1][k] -1 * a[i][j][k-1] + 4 * a[i][j][k];
    }

    print3dmatrix(a);
    printf("******************************\n");

    // loop 2
    //#pragma omp parallel for num_threads(1)
    for(int i=1; i<= nx; i++) 
        for(int j=1; j<= ny; j++)
            // #pragma omp simd
            for(int m=j+1; m<= j+nz; m++)
            {
                b[i][j][m-j] = -1*b[i-1][j][m-j] - 1*b[i][j-1][m-j] -1 * b[i][j][m-j-1] + 4 * b[i][j][m-j];
            }

    print3dmatrix(b);
    printf("=========================\n");

    return 0;
}

主要观察结果

  1. 矩阵a填充了0到5之间的随机数,循环1是未经转换的原始循环,而循环2是经过转换的循环
  2. 已转换的循环已倾斜,以消除依赖关系.
  3. 如果不进行openmp并行化运行,则运算矩阵a和b相同
  4. 如果部署了开放式mp,则答案会发生变化(由于种族条件而定)[循环不存在Parallellisable,无论将杂物放在何处]
  5. 如果使用#pragma omp simd强制执行最内层循环的矢量化,则会失败.

解决方案

这是循环携带依赖性的经典问题.您的每次迭代都依赖于其他一些迭代(已经完成),因此唯一可以安排其调度的方法是依次进行.

但这只是因为循环的编写方式.

您提到R[i][j][k]取决于R[i-1][j][k]R[i][j-1][k]R[i][j][k-1]的计算.我在这里看到三个依赖-

  1. [1、0、0]
  2. [0,1,0]
  3. [0,0,1]

我希望这种表示是直观的.

对于您当前的情况,依赖性1)和2)没问题,因为k中有一个0并且在i/j中有一个1,这意味着迭代确实不依赖于先前的k迭代来完成这两个依赖关系.

问题是由于3).由于k中有一个1,因此每次迭代都取决于它的上一个迭代.如果我们能够以某种方式在i/j中带一个数字>0,我们就可以完成.循环歪斜变换使我们可以完全相同.

3D示例有点难以理解.因此,让我们看一下使用ij的2D示例.

假设-R[i][j]取决于R[i-1][j]R[i][j-1].我们有同样的问题.

如果我们必须在图片中表示它,它看起来像这样-

. <- . <- .
     |    |
     v    v
. <- . <- .
     |    |
     v    v
.    .    .

在此图中,每个点表示迭代(i,j),并且源自每个点的箭头指向其依赖的迭代.显而易见,为什么我们不能在这里并行化最内层的循环.

但是假设我们的偏斜为-

        .
       /|   
      / |
    .   .
   /|  /|
  / | / |
.   .   . 
   /|    
  / |  
.   .     


. 

如果您绘制与上图中相同的箭头(我无法在ASCII艺术作品中绘制对角箭头).

您将看到所有箭头都指向下方,即它们至少向下迭代,这意味着您可以并行化水平循环.

现在说您的新循环尺寸为y(外循环)和x(内循环),

您的原始变量ij将是

j = xi = x - y

您的循环主体因此变为-

for ( y = 0; y < j_max + i_max; y++) 
    for ( x = 0; x < j_max; x++)
        R_dash[y][x] = R_dash[y-1][x-1] + R_dash[y-1][x];

其中R_dash是倾斜域,并且与R一对一映射

您将看到R_dash[y-1][x-1]R_dash[y-1][x]都将在y的某些先前迭代中进行计算.因此,您可以完全并行化x循环.

此处应用的转换是

i -> i, j -> i + j.

您可以类似地针对3个维度进行计算.

要进一步了解仿射变换如何工作以及如何将其用于引入并行性,请参阅以下schedule(static, nx/NTt) where NTt is just a macro for the available number of threads. Can it be broken down in all directions using the directive collapse?

------- MAJOR EDIT -------------------------- following Ajay's answer here is a minimum working example

#include<stdio.h>
#include<stdlib.h>
#include<time.h>
#include<omp.h>

typedef double lr;

#define nx 4
#define ny 4
#define nz 4

void
print3dmatrix(double a[nx+2][ny+2][nz+2])
{
    for(int i=1; i<= nx; i++) {
        for(int j=1; j<= ny; j++) {
            for(int k=1; k<= nz; k++) {
                printf("%f ", a[i][j][k]);
            }
            printf("\n");
        }
        printf("\n");
    }
}

int 
main()
{

    double a[nx+2][ny+2][nz+2];
    double b[nx+2][ny+2][nz+2];

    srand(3461833726);


    // matrix filling 
    // b is just a copy of a
    for(int i=0; i< nx+2; i++) for(int j=0; j< ny+2; j++) for(int k=0; k< nz+2; k++)
    {
        a[i][j][k] = rand() % 5;
        b[i][j][k] = a[i][j][k];
    }

    // loop 1
    //#pragma omp parallel for num_threads(1)
    for(int i=1; i<= nx; i++) for(int j=1; j<= ny; j++) for(int k=1; k<= nz; k++)
    {
        a[i][j][k] = -1*a[i-1][j][k] - 1*a[i][j-1][k] -1 * a[i][j][k-1] + 4 * a[i][j][k];
    }

    print3dmatrix(a);
    printf("******************************\n");

    // loop 2
    //#pragma omp parallel for num_threads(1)
    for(int i=1; i<= nx; i++) 
        for(int j=1; j<= ny; j++)
            // #pragma omp simd
            for(int m=j+1; m<= j+nz; m++)
            {
                b[i][j][m-j] = -1*b[i-1][j][m-j] - 1*b[i][j-1][m-j] -1 * b[i][j][m-j-1] + 4 * b[i][j][m-j];
            }

    print3dmatrix(b);
    printf("=========================\n");

    return 0;
}

Key observations-

  1. Matrix a is filled with random numbers in between 0 to 5 and loop 1 is non transformed original loop whereas loop 2 is a transformed loop
  2. The transformed loop has been skewed so as to remove dependency.
  3. After the operation matrix a and b are same if run without the openmp parallelization
  4. if open mp is deployed the answers change (because of maybe race conditions) [ loop is not paralellisable irrespective of where the pragma is placed ]
  5. if #pragma omp simd is used to enforce vectorisation of innermost loop it fails.

解决方案

This is a classic problem of loop carried dependences. Every iteration of yours depend on some other iterations (to have finished), and the only way it can be scheduled is thus serially.

But that is just because how your loop is written.

You mention that R[i][j][k] depends on the calculation of R[i-1][j][k], R[i][j-1][k], R[i][j][k-1]. I see three dependences here -

  1. [1, 0, 0]
  2. [0, 1, 0]
  3. [0, 0, 1]

I hope this representation is intuitive.

For your present scenario, dependence 1) and 2) are not an issue because there is a 0 in k and there are 1 in i/j, which means that the iteration does not depend on previous iterations of k to complete for these two dependences.

The problem is because of 3). Since there is a 1 in k, every iteration depends on it's previous iteration. If somehow we were able to bring a number >0 in i/j we would be done. A loop skew transformation lets us do exactly the same.

A 3D example is slightly difficult to understand. So let's look at 2D example with i and j.

Suppose - R[i][j] depends on R[i-1][j] and R[i][j-1]. We have the same problem.

If we have to represent this in a picture it looks like this -

. <- . <- .
     |    |
     v    v
. <- . <- .
     |    |
     v    v
.    .    .

In this picture, every point represents and iteration (i,j) and the arrows originating from each point, point to the iteration it depends on. It is clear to see why we cannot parallelize the inner most loop here.

But suppose we did the skewing as -

        .
       /|   
      / |
    .   .
   /|  /|
  / | / |
.   .   . 
   /|    
  / |  
.   .     


. 

And if you draw the same arrows as in the above picture (I cannot draw diagonal arrows in the ASCII art).

You will see that all the arrows are pointing downwards i.e. they atleast go on iteration down, which means you can parallelize the horizontal loop.

Now say your new loop dimensions are y (outer loop) and x (inner loop),

your original variables i, j will be

j = x and i = x - y

Your loop body thus becomes -

for ( y = 0; y < j_max + i_max; y++) 
    for ( x = 0; x < j_max; x++)
        R_dash[y][x] = R_dash[y-1][x-1] + R_dash[y-1][x];

Where R_dash is the skewed domain and has a one to one mapping to R

You will see that both R_dash[y-1][x-1] and R_dash[y-1][x] will be computed in some previous iteration of y. And hence you can completely parallelize the x loop.

The transformation applied here is

i -> i, j -> i + j.

You can similarly work it out for 3 dimensions.

For further understanding on how affine transformations work and how they can be used to introduce parallelism, you can see these lecture notes.

这篇关于具有数据依赖性的for循环的矢量化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆