OpenMP的扩展问题 [英] Scaling issues with OpenMP

查看：74 发布时间：2020/5/21 1:31:01 c++ parallel-processing openmp physics scientific-computing

本文介绍了OpenMP的扩展问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经编写了一种特殊类型的3D CFD仿真代码，即Lattice-Boltzmann方法(与TimmKrüger等人在《 Lattice Boltzmann方法》一书中提供的代码非常相似). 使用OpenMP对程序进行多线程处理我遇到了一些我不太了解的问题:结果证明在很大程度上取决于整个域的大小.

I have written a code for a special type of 3D CFD Simulation, the Lattice-Boltzmann Method (quite similar to a code supplied with the Book "The Lattice Boltzmann Method" by Timm Krüger et alii). Multithreading the program with OpenMP I have experienced issues that I can't quite understand: The results prove to be strongly dependent on the overall domain size.

基本原理是，为3D域的每个像元分配了离散方向上19个分布函数(0-18)的某些值.它们被放置在堆中分配的两个线性数组中(一个总体被布置在separte数组中):某个单元的18个总体在内存中是连续的，连续x值的值彼此相邻，因此上(这样的行主要:人口-> x-> y-> z). 这些分布函数会根据单元内的某些值重新分布，然后流式传输到相邻单元.因此，我有两个总体f1和f2.该算法从f1中获取值，重新分配它们并将其复制到f2中.然后交换指针，算法再次开始. 该代码在单个内核上运行良好，但是当我尝试在多个内核上并行化时，我得到的性能取决于域的整体大小:对于非常小的域(10 ^ 3个单元)，该算法的速度相对较慢每秒1500万个单元，对于非常小的域(30 ^ 3个单元)，该算法的运行速度非常快，每秒超过6000万个单元，对于更大的性能，该性能又下降到每秒约3000万个单元.在单个内核上执行代码只会导致每秒大约1500万个单元的相同性能.当然，这些结果在不同的处理器之间会有所不同，但是从本质上讲，仍然存在相同的问题！

The basic principle is that each cell of a 3D domain gets assigned certain values for 19 distribution functions (0-18) in discrete directions. They are laid down in two linear arrays allocated in the heap (one population is layed out in a separte array): The 18 populations of a certain cell are contiguous in memory, the values of consecutive x-values lay next to each other and so on (so sort of row-major: populations->x->y->z). Those distribution functions redistribute according to certain values within the cell and then get streamed to the neighbouring cells. For this reason I have two populations f1 and f2. The algorithm takes the values from f1, redistributes them and copies them into f2. Then the pointers are swapped and the algorithm starts again. The code is working perfectly fine on a single core but when I try to parallelise it on multiple cores I get a performance that depends on the overall size of the domain: For very small domains (10^3 cells) the algorithm is comparably slow with 15 million cells per second, for quite small domains (30^3 cells) the algorithm is about quite fast with over 60 million cells per second and for anything larger than that the performance drops again to about 30 million cells per second. Executing the code on a single core only leads to the same performance of about 15 million cells per second. These results of course vary between different processors but qualitatively the same issue remains!

代码的核心归结为这个并行循环，该循环一遍又一遍地执行，并交换指向f1和f2的指针:

The core of the code boils down to this parallelised loop that is executed over and over again and the pointers to f1 and f2 are swapped:

#pragma omp parallel for default(none) shared(f0,f1,f2) schedule(static)
    for(unsigned int z = 0; z < NZ; ++z)
    {
        for(unsigned int y = 0; y < NY; ++y)
        {
            for(unsigned int x = 0; x < NX; ++x)
            {
                /// temporary populations
                double ft0  = f0[D3Q19_ScalarIndex(x,y,z)];
                double ft1  = f1[D3Q19_FieldIndex(x,y,z,1)];
                double ft2  = f1[D3Q19_FieldIndex(x,y,z,2)];
                double ft3  = f1[D3Q19_FieldIndex(x,y,z,3)];
                double ft4  = f1[D3Q19_FieldIndex(x,y,z,4)];
                double ft5  = f1[D3Q19_FieldIndex(x,y,z,5)];
                double ft6  = f1[D3Q19_FieldIndex(x,y,z,6)];
                double ft7  = f1[D3Q19_FieldIndex(x,y,z,7)];
                double ft8  = f1[D3Q19_FieldIndex(x,y,z,8)];
                double ft9  = f1[D3Q19_FieldIndex(x,y,z,9)];
                double ft10 = f1[D3Q19_FieldIndex(x,y,z,10)];
                double ft11 = f1[D3Q19_FieldIndex(x,y,z,11)];
                double ft12 = f1[D3Q19_FieldIndex(x,y,z,12)];
                double ft13 = f1[D3Q19_FieldIndex(x,y,z,13)];
                double ft14 = f1[D3Q19_FieldIndex(x,y,z,14)];
                double ft15 = f1[D3Q19_FieldIndex(x,y,z,15)];
                double ft16 = f1[D3Q19_FieldIndex(x,y,z,16)];
                double ft17 = f1[D3Q19_FieldIndex(x,y,z,17)];
                double ft18 = f1[D3Q19_FieldIndex(x,y,z,18)];

                /// microscopic to macroscopic
                double r    = ft0 + ft1 + ft2 + ft3 + ft4 + ft5 + ft6 + ft7 + ft8 + ft9 + ft10 + ft11 + ft12 + ft13 + ft14 + ft15 + ft16 + ft17 + ft18;
                double rinv = 1.0/r;
                double u    = rinv*(ft1 - ft2 + ft7 + ft8  + ft9   + ft10 - ft11 - ft12 - ft13 - ft14);
                double v    = rinv*(ft3 - ft4 + ft7 - ft8  + ft11  - ft12 + ft15 + ft16 - ft17 - ft18);
                double w    = rinv*(ft5 - ft6 + ft9 - ft10 + ft13 -  ft14 + ft15 - ft16 + ft17 - ft18);

                /// collision & streaming
                double trw0 = omega*r*w0;                   //temporary variables
                double trwc = omega*r*wc;
                double trwd = omega*r*wd;
                double uu   = 1.0 - 1.5*(u*u+v*v+w*w);

                double bu = 3.0*u;
                double bv = 3.0*v;
                double bw = 3.0*w;

                unsigned int xp = (x + 1) % NX;             //calculate x,y,z coordinates of neighbouring cells
                unsigned int yp = (y + 1) % NY;
                unsigned int zp = (z + 1) % NZ;
                unsigned int xm = (NX + x - 1) % NX;
                unsigned int ym = (NY + y - 1) % NY;
                unsigned int zm = (NZ + z - 1) % NZ;

                f0[D3Q19_ScalarIndex(x,y,z)]      = bomega*ft0  + trw0*(uu);                        //redistribute distribution functions and stream to neighbouring cells
                double cu = bu;
                f2[D3Q19_FieldIndex(xp,y, z,  1)] = bomega*ft1  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = -bu;
                f2[D3Q19_FieldIndex(xm,y, z,  2)] = bomega*ft2  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = bv;
                f2[D3Q19_FieldIndex(x, yp,z,  3)] = bomega*ft3  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = -bv;
                f2[D3Q19_FieldIndex(x, ym,z,  4)] = bomega*ft4  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = bw;
                f2[D3Q19_FieldIndex(x, y, zp, 5)] = bomega*ft5  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = -bw;
                f2[D3Q19_FieldIndex(x, y, zm, 6)] = bomega*ft6  + trwc*(uu + cu*(1.0 + 0.5*cu));
                cu = bu+bv;
                f2[D3Q19_FieldIndex(xp,yp,z,  7)] = bomega*ft7  + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = bu-bv;
                f2[D3Q19_FieldIndex(xp,ym,z,  8)] = bomega*ft8  + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = bu+bw;
                f2[D3Q19_FieldIndex(xp,y, zp, 9)] = bomega*ft9  + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = bu-bw;
                f2[D3Q19_FieldIndex(xp,y, zm,10)] = bomega*ft10 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bu+bv;
                f2[D3Q19_FieldIndex(xm,yp,z, 11)] = bomega*ft11 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bu-bv;
                f2[D3Q19_FieldIndex(xm,ym,z, 12)] = bomega*ft12 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bu+bw;
                f2[D3Q19_FieldIndex(xm,y, zp,13)] = bomega*ft13 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bu-bw;
                f2[D3Q19_FieldIndex(xm,y, zm,14)] = bomega*ft14 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = bv+bw;
                f2[D3Q19_FieldIndex(x, yp,zp,15)] = bomega*ft15 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = bv-bw;
                f2[D3Q19_FieldIndex(x, yp,zm,16)] = bomega*ft16 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bv+bw;
                f2[D3Q19_FieldIndex(x, ym,zp,17)] = bomega*ft17 + trwd*(uu + cu*(1.0 + 0.5*cu));
                cu = -bv-bw;
                f2[D3Q19_FieldIndex(x, ym,zm,18)] = bomega*ft18 + trwd*(uu + cu*(1.0 + 0.5*cu));
            }
        }
    }

如果有人可以给我提示如何找到这种特定行为的原因，或者甚至知道导致该问题的原因，那将是非常棒的. 如果需要，我可以提供完整版本的简化代码！提前非常感谢！

It would be awesome if someone could give me tips on how to find the reason for this particular behaviour or even has an idea what could cause this problem. If needed I can supply a full version of the simplified code! Thanks a lot in advance!

推荐答案

在共享内存系统(单台计算机上的线程代码)上实现缩放非常棘手，并且通常需要进行大量调整.代码中可能发生的事情是，每个线程的部分域都适合相当小的"问题大小的高速缓存，但是随着问题大小在NX和NY中的增加，每个线程的数据将不再适合高速缓存.

Achieving scaling on shared memory systems (threaded code on a single machine) is quite tricky, and often requires large amounts of tuning. What's likely happening in your code is that part of the domain for each thread fits into cache for the "quite small" problem size, but as the problem size increases in NX and NY, the data per thread stops fitting into cache.

为避免此类问题，最好将域分解为固定大小的块，这些块的大小不随域而改变，而随数量而改变.

To avoid issues like this, it is better to decompose the domain into fixed size blocks that do not change in size with the domain, but rather in number.

const unsigned int numBlocksZ = std::ceil(static_cast<double>(NZ) / BLOCK_SIZE);
const unsigned int numBlocksY = std::ceil(static_cast<double>(NY) / BLOCK_SIZE);
const unsigned int numBlocksX = std::ceil(static_cast<double>(NX) / BLOCK_SIZE);

#pragma omp parallel for default(none) shared(f0,f1,f2) schedule(static,1)
for(unsigned int block = 0; block < numBlocks; ++block)
{
  unsigned int startZ = BLOCK_SIZE* (block / (numBlocksX*numBlocksY));
  unsigned int endZ = std::min(startZ + BLOCK_SIZE, NZ);
  for(unsigned int z = startZ; z < endZ; ++z) {
    unsigned int startY = BLOCK_SIZE*(((block % (numBlocksX*numBlocksY)) / numBlocksX);
    unsigned int endY = std::min(startY + BLOCK_SIZE, NY);
    for(unsigned int y = startY; y < endY; ++y)
    {
      unsigned int startX = BLOCK_SIZE(block % numBlocksX);
      unsigned int endX = std::min(startX + BLOCK_SIZE, NX);
      for(unsigned int x = startX; x < endX; ++x)
      {
        ...
      }
    }  
  }

上述方法也应通过使用3d阻止来提高缓存位置(假设是3d模具操作)，并进一步提高您的性能.您需要调整BLOCK_SIZE，以找到在给定系统上能够为您提供最佳性能的东西(我会从小做起，并以2的幂增加，例如4、8、16 ...).

An approach like the above should also increase cache locality by using 3d blocking (assuming this is a 3d stencil operation), and further improve your performance. You'll need to tune BLOCK_SIZE to find what gives you the best performance on a given system (I'd start small and increase in powers of two, e.g., 4, 8, 16...).

这篇关于OpenMP的扩展问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

OpenMP的扩展问题 [英] Scaling issues with OpenMP

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

OpenMP的扩展问题 [英] Scaling issues with OpenMP

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭