Matlab限制了TBB,但没有限制OpenMP [英] Matlab limits TBB but not OpenMP

查看:94
本文介绍了Matlab限制了TBB,但没有限制OpenMP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是要问这个,以了解我花了24小时来修复的问题.

I'm only asking this to try to understand what I've spent 24 hours trying to fix.

我的系统: Ubuntu 12.04.2, Matlab R2011a, 它们都是基于Nehalem的64位Intel Xeon处理器.

My system: Ubuntu 12.04.2, Matlab R2011a, both of them 64-bit, Intel Xeon processor based on Nehalem.

问题很简单,Matlab允许基于OpenMP的程序利用启用了超线程的所有CPU内核,但对TBB不允许相同.

The problem is simply, Matlab allows OpenMP based programs to utilize all CPU cores with hyper-threading enabled but does not allow the same for TBB.

运行TBB时,即使将maxNumCompThreads更改为8,我也只能启动4个线程.使用OpenMP时,我可以使用所需的所有线程.如果没有超线程,TBB和OpenMP当然都会利用全部4个内核.

When running TBB, I can launch only 4 threads, even when I change the maxNumCompThreads to 8. While with OpenMP I can use all the threads I want. Without Hyper-threading, both TBB and OpenMP utilize all 4 cores of course.

我了解超线程及其虚拟功能,但是matlab的限制实际上确实会影响性能(额外的

I understand Hyper-threading and that its virtual, but the limitation matlab does, actually does cause a penalty on the performance (an extra reference).

我使用2个程序(一个简单的for循环)测试了此问题

I tested this issue using 2 programs, a simple for loop with

#pragma omp parallel for

和另一个基于tbb示例代码的非常简单的循环.

and another very simple loop based on a tbb sample code.

tbb::task_scheduler_init init(tbb::task_scheduler_init::deferred);
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());

并用matlab mexFunction包裹它们.

and wrapped both of them with a matlab mexFunction.

有人对此有解释吗?线程创建方法或结构是否存在固有的差异,从而使matlab可以限制TBB,但不允许OpenMP受到这种限制?

Does any one have an explanation for this? Is there an inherent difference in the thread creation method or structure that allows matlab to throttle TBB but does not allow this throttoling for OpenMP?

参考代码:

OpenMP:

#include "mex.h"

void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ){
    threadCount = 100000;
#pragma omp parallel for
    for(int globalId = 0; globalId < threadCount ; globalId++)
    {
        for(long i=0;i<1000000000L;++i) {} // Deliberately run slow
    }
}

TBB:

#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"

struct mytask {
  mytask(size_t n)
    :_n(n)
  {}
  void operator()() {
    for (long i=0;i<1000000000L;++i) {}  // Deliberately run slow
    std::cerr << "[" << _n << "]";
  }
  size_t _n;
};

template <typename T> struct invoker {
  void operator()(T& it) const {it();}
};

void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const
mxArray* prhs[]) {

  tbb::task_scheduler_init init(tbb::task_scheduler_init::deferred);  // Automatic number of threads

  std::vector<mytask> tasks;
  for (int i=0;i<10000;++i)
    tasks.push_back(mytask(i));

  tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());

}

推荐答案

很抱歉,回答花了很长时间.指定deferred只会阻止任务调度程序创建线程池,直到第一个并行构造开始.默认情况下,线程数为automatic,它与内核数相对应(代码设置在src/tbb/tbb_misc_ex.cpp中,并且还取决于CPU的亲和力.请参见initialize_hardware_concurrency_info())

Sorry it took so long to answer. Specifying deferred just keeps the task scheduler from creating the thread pool until the first parallel construct starts. By default, the number of threads is automatic, which corresponds to the number of cores (the code setting this is in src/tbb/tbb_misc_ex.cpp, and also depends on CPU affinity among other things. See initialize_hardware_concurrency_info())

我稍微修改了您的代码:

I modified your code slightly:

#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include "tbb/atomic.h"
#include "tbb/spin_mutex.h"
#include <iostream>
#include <vector>

// If LOW_THREAD == 0, run with task_scheduler_init(automatic), which is the number
// of cores available.  If 1, start with 1 thread.

#ifndef NTASKS
#define NTASKS 50
#endif
#ifndef MAXWORK
#define MAXWORK 400000000L
#endif
#ifndef LOW_THREAD
#define LOW_THREAD 0  // 0 == automatic
#endif

tbb::atomic<size_t> cur_par;
tbb::atomic<size_t> max_par;

#if PRINT_OUTPUT
tbb::spin_mutex print_mutex;
#endif

struct mytask {
  mytask(size_t n) :_n(n) {}
  void operator()() {
      size_t my_par = ++cur_par;
      size_t my_old = max_par;
      while( my_old < cur_par) { my_old = max_par.compare_and_swap(my_par, my_old); }

      for (long i=0;i<MAXWORK;++i) {}  // Deliberately run slow
#if PRINT_OUTPUT
      {
          tbb::spin_mutex::scoped_lock s(print_mutex);
          std::cerr << "[" << _n << "]";
      }
#endif
      --cur_par;
  }
  size_t _n;
};

template <typename T> struct invoker {
  void operator()(T& it) const {it();}
};

void mexFunction(/*int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[]*/) {

    for( size_t thr = LOW_THREAD; thr <= 128; thr = thr ? thr * 2: 1) {
        cur_par = max_par = 0;
        tbb::task_scheduler_init init(thr == 0 ? (unsigned int)tbb::task_scheduler_init::automatic : thr);

        std::vector<mytask> tasks;
        for (int i=0;i<NTASKS;++i) tasks.push_back(mytask(i));

        tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
        std::cout << " for thr == ";
        if(thr) std::cout << thr; else std::cout << "automatic";
        std::cout << ", maximum parallelism == " << (size_t)max_par << std::endl;
    }
}

int main() {
    mexFunction();
}

我在此处的16核系统上运行了此

I ran this on a 16-core system here:

for thr == automatic, maximum parallelism == 16
for thr == 1, maximum parallelism == 1
for thr == 2, maximum parallelism == 2
for thr == 4, maximum parallelism == 4
for thr == 8, maximum parallelism == 8
for thr == 16, maximum parallelism == 16
for thr == 32, maximum parallelism == 32
for thr == 64, maximum parallelism == 50
for thr == 128, maximum parallelism == 50

限制50是该程序创建的任务总数.

The limit of 50 is the total number of tasks created by the program.

由TBB创建的线程由程序启动的并行构造共享,因此,如果同时运行两个并行的for_each,则最大线程数将保持不变.每个for_each的运行速度会更慢. TBB库无法控制OpenMP构造中使用的线程数,因此OpenMP parallel_for和TBB parallel_for_each通常会超额预订计算机.

The threads created by TBB are shared by the parallel constructs started by the program, so if you have two parallel for_each running simultaneously, the maximum number of threads will not change; each for_each will run more-slowly. The TBB library does not control the number of threads used in OpenMP constructs, so an OpenMP parallel_for and a TBB parallel_for_each will generally oversubscribe the machine.

这篇关于Matlab限制了TBB,但没有限制OpenMP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆