优化利用SIMD的一维热方程 [英] Optimising an 1D heat equation using SIMD

查看:142
本文介绍了优化利用SIMD的一维热方程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是CFD code(用于计算流体动力学)。最近,我看到英特尔编译器在我的循环的一个使用SSE,在这个循环中增加了计算性能的2倍几乎因素的机会。然而,使用SSE和SIMD指令似乎更像是运气。在大多数情况下,编译器做任何事情。

I am using a CFD code (for computational fluid dynamic). I recently had the chance to see Intel Compiler using SSE in one of my loops, adding a nearly 2x factor to computation performances in this loop. However, the use of SSE and SIMD instructions seems more like luck. Most of the time, the compiler do nothing.

我再试图迫使SSE的使用,考虑到AVX指令将在不久的将来加强这方面的。

I am then trying to force the use of SSE, considering that AVX instructions will reinforce this aspect in the near future.

我做了一个简单的一维传热code。它分为两个阶段,使用其他(U0 - > U1,然后U1 - > U0,然后U0 - > U1,等)的结果。当它遍历,它收敛到稳定的解决方案。我的大部分主code循环使用相同类型的计算。 (有限差分)。

I made a simple 1D heat transfer code. It consist of two phases, using the results of the other (U0 -> U1, then U1 -> U0, then U0 -> U1, etc). When it iterates, it converge to the stable solution. Most of my loops in the main code are using the same kind of calculation. (Finite Difference).

但是,我的code是比正常循环慢两倍。结果萨姆斯,这样的计算是一致的。

However, my code is two times slower than the normal loop. Results are the sames, so calculations are consistent.

我是不是犯了一个错误?我使用的是酷睿2测试循环,(使用Westmer)超级计算机上进行测试了。

Did I make a mistake ? I am using a Core 2 to test the loop, before testing on the super computer (using Westmer).

下面是code,与SSE循环,然后参考循环:

Here is the code, with the SSE loop, and then the reference loop :

#include <stdio.h>
#include <emmintrin.h>
#include <time.h>
//#include <vector>
#define n1 1004
#define niter 200000

int i,j,t;

double U0[n1] __attribute__ ((aligned(16)));
double U1[n1] __attribute__ ((aligned(16)));
double Dx,Dy,Lx,Ly,InvDxDx,Dt,alpha,totaltime,Stab,DtAlpha,DxDx;

__m128d vmmx00;
__m128d vmmx01;
__m128d vmmx02;
__m128d vmmx10;

__m128d va;
__m128d vb;
__m128d vc;
__m128d vd;

clock_t time0,time1;
FILE *f1;
int main()
{
/* ---- GENERAL ---- */
   alpha = 0.4;
   totaltime = 1.0/100.0;
   Dt = totaltime/((niter-1)*1.0);
   Lx = 1.0;
   Dx = Lx/((n1-1)*1.0);
   InvDxDx = 1.0/(Dx*Dx);
   DxDx = Dx*Dx;
   Stab = alpha*Dt*(InvDxDx);
   DtAlpha = Dt*alpha;
/* Stability if result <= 0.5 */
   printf("Stability factor : %f \n",Stab);


   for( i = 0; i < n1; i++){U0[i] = 0.0;}
   U0[1] = 1.0;
   U0[2] = 1.0;
   U0[3] = 1.0;
   U0[n1-2] = 2.0;

//    for ( i = 0; i < n1; i++) {
 //       for ( j = i + 1; j < n2; j++) {
  //          std::swap(U0[i][j], U0[j][i]);
   //     }
    //}

    va = _mm_set1_pd(-2.0);
    vb = _mm_set1_pd(InvDxDx);
    vd = _mm_set1_pd(DtAlpha);

 time0=clock();

 for( t = 0; t < niter; t++)
 {

    for( i = 2; i < n1-2; i+=2)
    {
      //printf("%d  %d  \n",i,j);
      //fflush(stdout);
      vmmx00 = _mm_load_pd(&U0[i]);
      vmmx01 = _mm_loadu_pd(&U0[i+1]);
      vmmx02 = _mm_loadu_pd(&U0[i-1]);

      vmmx10 = _mm_mul_pd(va,vmmx00);   // U1[i][j] = -2.0*U0[i][j];
      vmmx10 = _mm_add_pd(vmmx10,vmmx01);   // U1[i][j] = U1[i][j] + U0[i+1][j];
      vmmx10 = _mm_add_pd(vmmx10,vmmx02);   // U1[i][j] = U1[i][j] + U0[i-1][j];
      vmmx10 = _mm_mul_pd(vb,vmmx10);   // U1[i][j] = U1[i][j] * InvDxDx;

      vmmx10 = _mm_mul_pd(vd,vmmx10);   // U1[i][j] = U1[i][j] * DtAlpha;
      vmmx10 = _mm_add_pd(vmmx10,vmmx00);   // U1[i][j] = U1[i][j] + U0[i][j];

      _mm_store_pd(&U1[i],vmmx10);

      // U1[i][j] = U0[i][j] + DtAlpha*( (U0[i+1][j]-2.0*U0[i][j]+U0[i-1][j])*InvDxDx
    }

    for( i = 2; i < n1-2; i+=2)
    {
      //printf("%d  %d  \n",i,j);
      //fflush(stdout);
      vmmx00 = _mm_load_pd(&U1[i]);
      vmmx01 = _mm_loadu_pd(&U1[i+1]);
      vmmx02 = _mm_loadu_pd(&U1[i-1]);

      vmmx10 = _mm_mul_pd(va,vmmx00);   // U0[i][j] = -2.0*U1[i][j];
      vmmx10 = _mm_add_pd(vmmx10,vmmx01);   // U0[i][j] = U0[i][j] + U1[i+1][j];
      vmmx10 = _mm_add_pd(vmmx10,vmmx02);   // U0[i][j] = U0[i][j] + U1[i-1][j];
      vmmx10 = _mm_mul_pd(vb,vmmx10);   // U0[i][j] = U0[i][j] * InvDxDx;

      vmmx10 = _mm_mul_pd(vd,vmmx10);   // U0[i][j] = U0[i][j] * DtAlpha;
      vmmx10 = _mm_add_pd(vmmx10,vmmx00);   // U0[i][j] = U0[i][j] + U1[i][j];

      _mm_store_pd(&U0[i],vmmx10);

      // U1[i][j] = U0[i][j] + DtAlpha*( (U0[i+1][j]-2.0*U0[i][j]+U0[i-1][j])*InvDxDx
    }

 }

 time1=clock();
 printf("Loop 0, total time : %f \n", (double) time1-time0);
 f1 = fopen ("out0.dat", "wt");
 for( i = 1; i < n1-1; i++)
 {
       fprintf (f1, "%d\t%f\n", i, U0[i]);
 }


// REF

   for( i = 0; i < n1; i++){U0[i] = 0.0;}
   U0[1] = 1.0;
   U0[2] = 1.0;
   U0[3] = 1.0;
   U0[n1-2] = 2.0;

 time0=clock();

 for( t = 0; t < niter; t++)
 {

    for( i = 2; i < n1-2; i++)
    {
       U1[i] = U0[i] + DtAlpha* (U0[i+1]-2.0*U0[i]+U0[i-1])*InvDxDx;
    }

    for( i = 2; i < n1-2; i++)
    {
       U0[i] = U1[i] + DtAlpha* (U1[i+1]-2.0*U1[i]+U1[i-1])*InvDxDx;
    }

 }

 time1=clock();
 printf("Loop 0, total time : %f \n", (double) time1-time0);
 f1 = fopen ("outref.dat", "wt");
 for( i = 1; i < n1-1; i++)
 {
       fprintf (f1, "%d\t%f\n", i, U0[i]);
 }



}

+++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

编辑:

+++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

考虑您的回答,我找到了正确的地方讨论这个问题,所以我会扩大主题,并解释我的目标。如果接受,我们会在其他讨论后所有的环路之一。这可能会很长,但它可能是在我的领域,很多人非常有用的,而对于开源求解像OpenFoam。不考虑能耗的inpact(大家都用大supercalculators)。

Considering your answers, I found the right place to discuss this, so I will enlarge the topic and explain my aims. If you accept, we will discuss about all of the loops one after an other. This can be long, but it could be extremely useful for a lot of people in my domain, and for OpenSource solvers like OpenFoam. Not considering the inpact on energy consumption (we all use large supercalculators).

在CFD code我使用的时间超过1个月512 Westmer内核上运行。我使用MPI(消息传递接口)特效之间的通信。
物理系字段可以被看作是一个网格,所以1D,2D或3D阵列,这取决于模拟的类型。但是,3D是你能想象的最好的。

The CFD code I am using takes more than 1 month running on 512 Westmer Cores. I am using MPI (Message Passing Interface) to communicate between procs. The Physic field can be considered as a mesh, so 1D, 2D or 3D arrays, depending of the type of simulations. But 3D are the best you can imagine.

满code是在Fortran 95中,这实际上是一个C简化。很容易用C接口它,和C例程可以直接从Fortran的被调用,类型是​​萨姆斯(中间体,双,长等)。
但是,Fortran不允许这种优化:它被设计成简单的。这就是为什么我调查Ç说明。

The full code is in Fortran 95, which is in fact a C simplified. It is easy to interface it with C, and C routines can be call directly from Fortran, types are the sames (int, double, long, etc). But, Fortran does not allow such optimizations : it is designed to be simple. That is why I am investigating C instructions.

在所有CFD codeS,我们面临的问题萨姆斯:3类型的循环和MPI内存分配。让我们先来讨论关于循环:

In all CFD codes, we are facing the sames problems : 3 types of loops, and MPI memory distribution. Let's first discuss about the loops :


  • 空间衍生物(称为差分):
    该循环由一维卷积的适用于所有情况(1D,2D,3D,你衍生物仅一次一个AX)(DF = F的[I-1] * A + F [I] * B + F [I + 1] * C)。然而,使用比1D更多时,存储器访问成为以下

  • Spatial derivatives (called finite difference) : The loop consist of a 1D convolution for all cases (1D, 2D, 3D, you derivate only on one axe at a time) (DF = F[i-1]*A + F[i]*B + F[i+1]*C). However, when using more than 1D, the memory access become the following :

// x1 derivative
for i 1 -> n1
    for j 1 ->n2
        DF_x1[i][j] = F[i-1][j]*A + F[i][j]*B + F[i+1][j]*C

// x2 derivative
for i 1 -> n1
    for j 1 ->n2
        DF_x2[i][j] = F[i][j-1]*D + F[i][j]*E + F[i][j+1]*G

在第一循环中,存储器存取是不继续(Fortran中的反转,存储器被反相)。这是第一个问题。使用三维数组时同上。

In the first loop, memory access is not continue (the invert in Fortran, memory is inverted). This is the first problem. Idem when using 3D arrays.

泊松方程的分辨率,即矩阵乘法:
该循环由一个1D,2D或3D卷积,根据模拟的。这实际上是在一个第二衍生物(DDF = D(DF))。

Poisson equation resolution, i.e. matrix multiplication : The loop consist of a 1D, 2D or 3D convolution, depending of the simulation. This is in fact a second derivative (DDF = D(DF)).

for i 1 -> n1
    for j 1 ->n2
        DDF[i][j] = F[i-1][j]*A + F[i][j]*B + F[i+1][j]*C + F[i][j-1]*D + F[i][j]*E + F[i][j+1]*G

这循环是一样的循环我第一次给了你,但它是直接计算,而不是奇数和偶数。

This loop is the same as the loop i first gave you, but it is computed directly, not even and odd.

高斯加权分辨率赛德尔,即同一回路为波纹管,但相关性:

Weighted Gauss Seidel resolution, i.e. the same loop as bellow, but with dependency :

// even
for i 1 -> n1
    for j 1 ->n2
        F1[i][j] = F0[i-1][j]*A + F0[i][j]*B + F0[i+1][j]*C + F0[i][j-1]*D + F0[i][j]*E + F0[i][j+1]*G

//odd
for i 1 -> n1
    for j 1 ->n2
        F0[i][j] = F1[i-1][j]*A + F1[i][j]*B + F1[i+1][j]*C + F1[i][j-1]*D + F1[i][j]*E + F1[i][j+1]*G

这是你之前调查循环。

然后,我们面临的另一个问题:内存分配。每个核心都有自己的内存,并且需要与其他人分享。
让我们考虑的最后一个循环,但简单的:

Then, we face an other problem : memory distribution. Each core has its own memory, and need to share it with others. Let's consider the last loop, but simplified :

    for t 1 -> niter

        // even
        for i 1 -> n1-2
                F1[i] = F0[i-1]*A + F0[i]*B + F0[i+1]*C

        //odd
        for i 1 -> n1-2
                F0[i] = F1[i-1]*A + F1[i]*B + F1[i+1]*C

考虑到N1 = 512,但是这不能被存储在本地存储器由于低RAM容量。存储器被分布在core0(1-> 255)和核1(256-512),它们不是在同一计算机上,但在网络上。
在这种情况下,在i上衍生= 256需要了解的点i = 255,但这个值是在其它进程内。含有其它处理器的值的存储器被称为GHOST存储器。因此,循环是:

Consider that n1=512, but this can not be stored in the local memory due to low RAM capacity. The memory is distributed on core0 (1->255) and core1 (256-512) that are NOT on the same computer, but on a network. In this case, the derivative in i=256 need to be aware of the point i=255, but this value is on the other proc. The memory containing the values of the other processors is called GHOST memory. So the loop is :

    ! update boundary memory :
    Share to ghost : core0 : F0[255] -> Network -> F0[0] : core1 (don't forget that for core1, the array restart from 0)
    Share to ghost : core1 : F0[1] -> Network -> F0[256] : core0 (you understand that F0[256] is the ghost for core0, and F0[0] is the ghost for core1)
    // even, each core do this loop.
    for i 1 -> n1-2
            F1[i] = F0[i-1]*A + F0[i]*B + F0[i+1]*C

    ! update boundary memory :
    Share to ghost : core0 : F1[255] -> Network -> F1[0] : core1
    Share to ghost : core1 : F1[1] -> Network -> F1[256] : core0

    //odd, each core do this loop.
    for i 1 -> n1-2
            F0[i] = F1[i-1]*A + F1[i]*B + F1[i+1]*C

我们需要处理这个问题。 Mysticial ,你现在看到的我要去的地方:循环交错需要考虑到这一点。
我认为这是可以做到这样的:

We need to deal with this. Mysticial, you now see where i am going : the loop interlacing need to take this into account. I think this can be done this way :

        ! update boundary memory :
        Share to ghost : core0 : F0[255] -> Network -> F0[0] : core1 
        Share to ghost : core1 : F0[1] -> Network -> F0[256] : core0

        for t 1 -> niter

        ! compute borders in advance :
        core0 only : F1[255] = F0[254]*A + F0[255]*B + F0[256]*C
        core1 only : F1[1] = F0[0]*A + F0[1]*B + F0[2]*C

        Launch Share to ghost asynchronous : core0 : F1[255] -> Network -> F1[0] : core1 
        Launch Share to ghost asynchronous : core1 : F1[1] -> Network -> F1[256] : core0

        During the same time (this can be done at the same time because MPI support asynchronous communications)
        // even
        for i 2 -> n1-3 (note the reduced domain)
                F1[i] = F0[i-1]*A + F0[i]*B + F0[i+1]*C

        Check that communications are done.


        ! compute borders in advance :
        core0 only : F0[255] = F1[254]*A + F1[255]*B + F1[256]*C
        core1 only : F0[1] = F1[0]*A + F1[1]*B + F1[2]*C

        Launch Share to ghost asynchronous : core0 : F0[255] -> Network -> F0[0] : core1 
        Launch Share to ghost asynchronous : core1 : F0[1] -> Network -> F0[256] : core0

        //odd, each core do this loop.
        for i 2 -> n1-3
                F0[i] = F1[i-1]*A + F1[i]*B + F1[i+1]*C

        Check that communications are done.

希望我没有做与索引错误的地方。
让我们考虑第一类型的时刻环路,这是simpliest的,并且我们可以通过环2和3,在此之后是相似的。这样做的目的是要做到这一点(其类似于图像处理):

Hope I didn't make a mistake with indices somewhere. Let's consider the first type of loops for the moment, which are the simpliest, and the we can go through loop 2 and 3 after, which are similar. The aim is to do this (which is similar to image processing):

    // x1 derivative
    for i 1 -> n1
        for j 1 ->n2
            DF_x1[i][j] = F[i-1][j]*A + F[i][j]*B + F[i+1][j]*C

    // x2 derivative
    for i 1 -> n1
        for j 1 ->n2
            DF_x2[i][j] = F[i][j-1]*D + F[i][j]*E + F[i][j+1]*G

我的工作就可以了,我会后的结果code在几个小时内,考虑到你的建议。

I am working on it, and I will post the result code in a few hours, taking into account your recommendations.

推荐答案

Mysticial在code明智的话:

The wise words of Mysticial in code:

  xmm7 = InvDxDx * dtAlpha; // precalculate
  xmm6 = -2*xmm7;
  xmm0 = *ptr++;   // has values d0, d1
  xmm1 = *ptr++;   // has values d2, d3
  while (loop--) {
     xmm2  = xmm0;  // take a copy of d0,d1
     xmm0 += xmm1;  // d0+d2, d1+d3
     xmm2 = shufps (xmm2,xmm1, 0x47);  // the middle elements d1,d2 ??
     xmm0 *= xmm7;  // sum of outer elements * factor
     xmm2 *= xmm6;  // -2 * center element * factor
     // here's still a nasty dependency
     xmm2 += xmm0;
     xmm0 = xmm1;    // shift registers
     *out_ptr ++= xmm2;  // flush
     xmm1 = *ptr++;  // read in new values

  }

这可以通过将环成2段,每段除了或者从不同的任务工作502项,交织的结果,它打破了依赖链得到改善。

This can be improved by dividing the loop into 2 segments, each working 502 entries apart or from different tasks and interleaving the results, which breaks the dependency chain.

另外,移位寄存器的方法XMM0&下; - 将xmm1,将xmm1&下; - 新可由展开循环两次,并改变XMM0的含义与xmm1每隔其他情况下可避免

Also, the shift register approach xmm0 <- xmm1, xmm1 <- new can be avoided by unrolling the loop twice and changing the meanings of xmm0 and xmm1 every other case.

这指向另一个大问题:使用内部函数时,寄存器溢出到内存中的所有时间

This points another big issue: when using intrinsics, registers are spilled to memory all the time.

这篇关于优化利用SIMD的一维热方程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆