2插槽系统上的OpenMP [英] OpenMP on a 2-socket system

查看:89
本文介绍了2插槽系统上的OpenMP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用C ++进行了一些科学的计算,并尝试利用OpenMP对某些循环进行并行化. 到目前为止,效果很好,例如在具有8个线程的Intel i7-4770上.

I do some scientific computations in C++, and try to utilize OpenMP for the parallelisation of some of the loops. This worked well so far, e.g. on a Intel i7-4770 with 8 threads.

我们有一个小型工作站,在一个主板上包含两个Intel CPU(E5-2680v2). 只要该代码在1个CPU上运行,并具有我喜欢的多个线程,代码就可以工作.但是,一旦我使用了第二个CPU,就会不时观察到错误的结果(大约每50到100次运行一次代码). 即使仅使用2个线程并将它们分配给两个不同的CPU,也会发生这种情况. 由于我们有5个这样的工作站(所有工作站都是相同的),因此我在其中每个工作站上运行了代码,所有程序都显示了此问题.

We have a small workstation which consists of two Intel CPUs (E5-2680v2) on one mainboard. The code works as long as it runs on 1 CPU with as many threads as I like. But as soon as I employ the second CPU, I observe incorrect results from time to time (around every 50th-100th time I run the code). This happens even when I use only 2 threads and assign them to the two different CPUs. As we have 5 of these workstations (all are identical), I ran the code on each one of them, and all show this problem.

工作站在OpenSuse 13.1,内核3.11.10-7上运行. g ++ 4.8.1和4.9.0以及Intel的icc 13.1.3.192存在此问题(尽管icc并不经常发生此问题,但仍然存在).

The workstation runs on OpenSuse 13.1, kernel 3.11.10-7. The problem exists with g++ 4.8.1 and 4.9.0, and with Intel's icc 13.1.3.192 (albeit the problem doesn't occur that often with icc, but it is still there).

症状可以描述如下:

  • 我有很多std :: complex:std::complex<double>* mFourierValues;
  • 在循环中,我访问并设置每个元素.每次迭代访问一个不同的元素,所以我没有并发访问(我检查了这个):mFourierValues[idx] = newValue;
  • 如果我随后将设置的数组值与输入值进行比较,大致为mFourierValues[idx] == newValue,则此检查有时会失败(尽管并非每次结果最终都是不正确的).
  • I have a large array of std::complex: std::complex<double>* mFourierValues;
  • In the loop, I access and set each element. Each iteration accesses a different element, so I do not have concurrent accesses (I checked this): mFourierValues[idx] = newValue;
  • If I compare the set array-value to the input-value afterwards, roughly mFourierValues[idx] == newValue, this check fails from time to time (although not every time the results end up being incorrect).

因此,症状似乎是我没有任何同步地并发访问元素.但是,当我将索引存储在std::vector中(带有正确的#pragma omp critical)时, 所有指标都是唯一的,并且在正确的范围内.

So the symptom looks like I access elements concurrently without any synchronizations. However, when I store the indices in a std::vector (with a proper #pragma omp critical), all indicies are unique and in the correct range.

经过几天的调试,我怀疑其他事情正在发生,并且我的代码是正确的. 对我来说,当CPU将缓存与主内存同步时,似乎发生了一些奇怪的事情.

After several days of debugging, my suspicion grows that something else is going on, and that my code is correct. To me it looks like something weird is happening when the CPUs synchronize the caches with the main-memory.

因此,我的问题是:

  • OpenMP甚至可以用于这样的系统吗? (我没有找到拒绝的消息来源.)
  • 是否存在针对这种情况的已知错误(我在错误跟踪器中未找到任何错误)?
  • 您认为问题可能在哪里?
    • 我的代码(似乎可以在1个具有多个内核的CPU上正常运行!),
    • 编译器(gcc和icc都!),
    • 操作系统
    • 硬件(在所有5个工作站上都​​损坏了吗?)
    • Can OpenMP even be used for such a system? (I haven't found a source which says no.)
    • Are there known bugs for such a situation (I haven't found any in the bug-trackers)?
    • Where is the problem probably located in your opinion?
      • My code (which seems to run fine on 1 CPU with multiple cores!),
      • the compilers (gcc, icc both!),
      • the OS,
      • the hardware (defect on all 5 workstations?)

      好的,我终于可以产生一个较短的(且自洽的)代码示例.

      OK, I was finally able to produce a shorter (and self-consistent) code example.

      • 保留一些内存空间.对于堆栈上的数组,可以通过以下方式访问它:complex<double> mAllElements[tensorIdx][kappa1][kappa2][kappa3]. IE.我有3个3级张量(tensorIdx).每个张量代表一个3维数组,由kappa1kappa2kappa3索引.
      • 我有4个嵌套循环(在所有4个索引上),而kappa1循环是并行化的(也是最外面的一个).它们位于DoComputation().
      • main()中,我调用DoComputation()一次以获得一些参考值,然后多次调用它并比较结果.它们应该完全匹配,但有时不匹配.
      • Reserve some memory space. For an array on the stack, this would be accessed like: complex<double> mAllElements[tensorIdx][kappa1][kappa2][kappa3]. I.e. I have 3 rank-3-tensors (tensorIdx). Each tensor represents a 3-dimensional array, indexed by kappa1, kappa2 and kappa3.
      • I have 4 nested loops (over all 4 indices), whereas the kappa1 loop is the one that gets parallized (and is the outermost one). They are located in DoComputation().
      • In main(), I call DoComputation() once to get some reference values, and then I call it several times and compare the results. They should match exactly, but sometimes they don't.

      不幸的是,代码仍然长约190行.我试图进一步简化它(仅使用1级的1张量,等等),但是后来我再也无法重现该问题.我猜这似乎是因为内存访问是不对齐的(tensorIdx上的循环是最里面的)(我知道,这远非最佳选择.)

      Unfortunately, the code is still around 190 lines long. I tried to simplify it further (only 1 tensor of rank 1, etc.), but then I was never able to reproduce the problem. I guess it appears because the memory-accesses are non-aligned (the loop over tensorIdx is the innermost one) (I know, this is far from optimal.)

      此外,在适当的位置需要一些延迟才能重现该错误.这就是nops()调用的原因.没有它们,代码运行速度会快很多,但是到目前为止,还没有显示出问题所在.

      Furthermore, some delays were needed in appropriate places, to reproduce the bug. That is the reason for the nops() calls. Without them the code runs a lot faster, but so far hasn't shown the problem.

      请注意,我再次检查了关键部分CalcElementIdx(),并认为它是正确的(每个元素只能访问一次).我还运行了valgrind的memcheck,helgrind和drd(使用正确重新编译的libgomp),所有这三个都没有出现错误.

      Note that I checked the critical part, CalcElementIdx(), again, and deem it correct (each element is accessed once). I also ran valgrind's memcheck, helgrind and drd (with proper recompiled libgomp), and all three gave no errors.

      在程序的第二到第三次启动时,我都会遇到一两个不匹配的情况.输出示例:

      Every second to third start of the program I get one or two mismatches. Example output:

      41      Is exactly 0
      42      Is exactly 0
      43      Is exactly 0
      44      Is exactly 0
      45      348496
      46      Is exactly 0
      47      Is exactly 0
      48      Is exactly 0
      49      Is exactly 0
      

      这对于gcc和icc都是正确的.

      This is true for gcc and icc.

      我的问题是:下面的代码对您来说正确吗? (除了明显的设计缺陷.) (如果时间太长,我会尝试进一步减小它,但是如上所述,到目前为止,我失败了.)

      My question is: Does the code below look correct to you? (Apart from obvious design flaws.) (If it is too long, I will try to reduce it further, but as described above I failed so far.)

      代码是用

      g++ main.cc -O3 -Wall -Wextra -fopenmp
      

      icc main.cc -O3 -Wall -Wextra -openmp
      

      两个版本均在2个具有40个线程的CPU上运行时显示了所描述的问题.我无法观察到1个CPU上的错误(以及我喜欢的多个线程).

      Both version show the described problem when run on 2 CPUs with a total of 40 threads. I couldn't observe the bug on 1 CPU (and as many threads as I like).

      // File: main.cc
      #include <cmath>
      #include <iostream>
      #include <fstream>
      #include <complex>
      #include <cassert>
      #include <iomanip>
      #include <omp.h>
      
      using namespace std;
      
      
      // If defined: We add some nops in certain places, to get the timing right.
      // Without them, I haven't observed the bug.
      #define ENABLE_NOPS
      
      // The size of each of the 3 tensors is: GRID_SIZE x GRID_SIZE x GRID_SIZE
      static const int GRID_SIZE = 60;
      
      //=============================================
      // Produces several nops. Used to get correct "timings".
      
      //----
      template<int N> __attribute__((always_inline)) inline void nop()
      {
          nop<N-1>();
          asm("nop;");
      }
      
      //----
      template<> inline void nop<0>() { }
      
      //----
      __attribute__((always_inline)) inline void nops()
      {
          nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>();
      }
      
      
      
      
      //=============================================
      /*
      Memory layout: We have 3 rank-3-tensors, i.e. 3 arrays of dimension 3.
      The layout looks like this: complex<double> allElements[tensorIdx][kappa1][kappa2][kappa3];
      The kappas represent the indices into a certain tensor, and are all in the interval [0; GRID_SIZE-1].
      */
      class MemoryManagerFFTW
      {
      public:
        //---------- Constructor ----------
        MemoryManagerFFTW()
        {
          mAllElements = new complex<double>[GetTotalNumElements()];
        }
      
        //---------- Destructor ----------
        ~MemoryManagerFFTW() 
        { 
          delete[] mAllElements; 
        }
      
        //---------- SetElement ----------
        void SetElement(int tensorIdx, int kappa1, int kappa2, int kappa3, const complex<double>& newVal)
        {
          // Out-of-bounds error checks are done in this function.
          const size_t idx = CalcElementIdx(tensorIdx, kappa1, kappa2, kappa3);
      
          // These nops here are important to reproduce the bug.
      #if defined(ENABLE_NOPS)
          nops();
          nops();
      #endif
      
          // A flush makes the bug appear more often.
          // #pragma omp flush
          mAllElements[idx] = newVal;
      
          // This was never false, although the same check is false in DoComputation() from time to time.
          assert(newVal == mAllElements[idx]);
        }
      
        //---------- GetElement ----------
        const complex<double>& GetElement(int tensorIdx, int kappa1, int kappa2, int kappa3)const
        {  
          const size_t idx = CalcElementIdx(tensorIdx, kappa1, kappa2, kappa3);
          return mAllElements[idx];
        }
      
      
        //---------- CalcElementIdx ----------
        size_t CalcElementIdx(int tensorIdx, int kappa1, int kappa2, int kappa3)const
        {
          // We have 3 tensors (index by "tensorIdx"). Each tensor is of rank 3. In memory, they are placed behind each other.
          // tensorStartIdx is the index of the first element in the tensor.
          const size_t tensorStartIdx = GetNumElementsPerTensor() * tensorIdx;
      
          // Index of the element relative to the beginning of the tensor. A tensor is a 3dim. array of size GRID_SIZE x GRID_SIZE x GRID_SIZE
          const size_t idxInTensor = kappa3 + GRID_SIZE * (kappa2 + GRID_SIZE * kappa1);
      
          const size_t finalIdx = tensorStartIdx + idxInTensor;
          assert(finalIdx < GetTotalNumElements());
      
          return finalIdx;
        }
      
      
        //---------- GetNumElementsPerTensor & GetTotalNumElements ----------
        size_t GetNumElementsPerTensor()const { return GRID_SIZE * GRID_SIZE * GRID_SIZE; }
        size_t GetTotalNumElements()const { return NUM_TENSORS * GetNumElementsPerTensor(); }
      
      
      
      public:
        static const int NUM_TENSORS = 3; // The number of tensors.
        complex<double>* mAllElements; // All tensors. An array [tensorIdx][kappa1][kappa2][kappa3]
      };
      
      
      
      
      //=============================================
      void DoComputation(MemoryManagerFFTW& mSingleLayerManager)
      {
        // Parallize outer loop.
        #pragma omp parallel for
        for (int kappa1 = 0; kappa1 < GRID_SIZE; ++kappa1)
        {
          for (int kappa2 = 0; kappa2 < GRID_SIZE; ++kappa2)
          {
            for (int kappa3 = 0; kappa3 < GRID_SIZE; ++kappa3)
            {    
      #ifdef ENABLE_NOPS
              nop<50>();
      #endif
              const double k2 = kappa1*kappa1 + kappa2*kappa2 + kappa3*kappa3;
              for (int j = 0; j < 3; ++j)
              {
                // Compute and set new result.
                const complex<double> curElement = mSingleLayerManager.GetElement(j, kappa1, kappa2, kappa3);
                const complex<double> newElement = exp(-k2) * k2 * curElement;
      
                mSingleLayerManager.SetElement(j, kappa1, kappa2, kappa3, newElement);
      
                // Check if the results has been set correctly. This is sometimes false, but _not_ always when the result is incorrect.
                const complex<double> test = mSingleLayerManager.GetElement(j, kappa1, kappa2, kappa3);
                if (test != newElement)
                  printf("Failure: (%g, %g) != (%g, %g)\n", test.real(), test.imag(), newElement.real(), newElement.imag());
              }
            }
          }
        }
      }
      
      
      
      //=============================================
      int main()
      {
        cout << "Max num. threads: " << omp_get_max_threads() << endl;
      
        // Call DoComputation() once to get a reference-array.
        MemoryManagerFFTW reference;
        for (size_t i = 0; i < reference.GetTotalNumElements(); ++i)
          reference.mAllElements[i] = complex<double>((double)i, (double)i+0.5);
        DoComputation(reference);
      
        // Call DoComputation() several times, and each time compare the result to the reference.
        const size_t NUM = 1000;
        for (size_t curTry = 0; curTry < NUM; ++curTry)
        {
          MemoryManagerFFTW mSingleLayerManager;
          for (size_t i = 0; i < mSingleLayerManager.GetTotalNumElements(); ++i)
            mSingleLayerManager.mAllElements[i] = complex<double>((double)i, (double)i+0.5);
          DoComputation(mSingleLayerManager);
      
          // Get the max. difference. This *should* be 0, but isn't from time to time.
          double maxDiff = -1;
          for (size_t i = 0; i < mSingleLayerManager.GetTotalNumElements(); ++i)
          {
            const complex<double> curDiff = mSingleLayerManager.mAllElements[i] - reference.mAllElements[i];
            maxDiff = max(maxDiff, max(curDiff.real(), curDiff.imag()));
          }
      
          if (maxDiff != 0)
            cout << curTry << "\t" << maxDiff << endl;
          else
            cout << curTry << "\t" << "Is exactly 0" << endl;
        }
      
        return 0;
      }
      

      编辑

      从下面的评论和Zboson的回答可以看出,内核3.11.10-7中存在一个错误.更新到3.15.0-1之后,我的问题消失了,并且代码按预期运行.

      Edit

      As can be seen from the comments and Zboson's answer below, there was a bug in kernel 3.11.10-7. After an update to 3.15.0-1, my problem is gone, and the code works as it should.

      推荐答案

      问题是由于Linux内核3.11.10-7中的错误所致. 该错误可能是由于内核如何处理使TLB缓存无效的原因如Hristo Iliev所指出.我猜想内核可能是问题所在,因为我读到

      The problem was due to a bug in Linux Kernel kernel 3.11.10-7. The bug may be due to how the kernel handles invalidating the TLB cache as pointed out by Hristo Iliev. I guessed that the kernel might be the problem because I read that there would be some improvements in Linux Kernel 3.15 for NUMA systems so I figured that the kernel version is important for NUMA systems.

      当OP将他的NUMA系统的Linux内核更新为3.15.0-1时,问题就消失了.

      When the OP updated the Linux kernel of his NUMA system to 3.15.0-1 the problem went away.

      这篇关于2插槽系统上的OpenMP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆