为什么omp_set_dynamic(1)从不调整线程数(在Visual C ++中)? [英] Why does omp_set_dynamic(1) never adjust the number of threads (in Visual C++)?

查看:135
本文介绍了为什么omp_set_dynamic(1)从不调整线程数(在Visual C ++中)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我们查看

If we look at the Visual C++ documentation of omp_set_dynamic, it is literally copy-pasted from the OMP 2.0 standard (section 3.1.7 on page 39):

如果[功能参数]计算为非零值,则运行时环境可以自动调整用于执行即将到来的并行区域的线程数,以最佳利用系统资源.结果,用户指定的线程数就是最大线程数.在执行并行区域的团队中,线程的数量在该并行区域的持续时间内保持不变,并由omp_get_num_threads函数报告.

很明显,omp_set_dynamic(1)允许实现为并行区域使用少于当前最大线程数的类(大概是为了防止在高负载下出现超额预订).对本段的任何合理理解都表明,通过查询平行区域内的omp_get_num_threads,可以观察到这种减少.

It seems clear that omp_set_dynamic(1) allows the implementation to use fewer than the current maximum number of threads for a parallel region (presumably to prevent oversubscription under high loads). Any reasonable reading of this paragraph would suggest that said reduction should be observable by querying omp_get_num_threads inside parallel regions.

(两个文档还将签名显示为void omp_set_dynamic(int dynamic_threads);.似乎用户指定的线程数"不是指dynamic_threads,而是表示无论用户使用剩余的OpenMP指定了什么"界面").

(Both documentations also show the signature as void omp_set_dynamic(int dynamic_threads);. It appears that "the number of threads specified by the user" does not refer to dynamic_threads but instead means "whatever the user specified using the remaining OpenMP interface").

但是,无论我将系统负载推到omp_set_dynamic(1)以下多高,omp_get_num_threads的返回值(在并行区域内查询)都不会与测试程序中的最大值保持不变.但是我仍然可以观察到omp_set_dynamic(1)omp_set_dynamic(0)之间明显的性能差异.

However, no matter how high I push my system load under omp_set_dynamic(1), the return value of omp_get_num_threads (queried inside the parallel regions) never changes from the maximum in my test program. Yet I can still observe clear performance differences between omp_set_dynamic(1) and omp_set_dynamic(0).

以下是重现此问题的示例程序:

Here is a sample program to reproduce the issue:

#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>
#include <cstdlib>
#include <cmath>

#include <omp.h>

#define UNDER_LOAD true

const int SET_DYNAMIC_TO = 1;

const int REPEATS = 3000;
const unsigned MAXCOUNT = 1000000;

std::size_t threadNumSum = 0;
std::size_t threadNumCount = 0;

void oneRegion(int i)
{
  // Pesudo-randomize the number of iterations.
  unsigned ui = static_cast<unsigned>(i);
  int count = static_cast<int>(((MAXCOUNT + 37) * (ui + 7) * ui) % MAXCOUNT);

#pragma omp parallel for schedule(guided, 512)
  for (int j = 0; j < count; ++j)
  {
    if (j == 0)
    {
      threadNumSum += omp_get_num_threads();
      threadNumCount++;
    }

    if ((j + i + count) % 16 != 0)
      continue;

    // Do some floating point math.
    double a = j + i;
    for (int k = 0; k < 10; ++k)
      a = std::sin(i * (std::cos(a) * j + std::log(std::abs(a + count) + 1)));

    volatile double out = a;
  }
}


int main()
{
  omp_set_dynamic(SET_DYNAMIC_TO);


#if UNDER_LOAD
  for (int i = 0; i < 10; ++i)
  {
    std::thread([]()
    {
      unsigned x = 0;
      float y = static_cast<float>(std::sqrt(2));
      while (true)
      {
//#pragma omp parallel for
        for (int i = 0; i < 100000; ++i)
        {
          x = x * 7 + 13;
          y = 4 * y * (1 - y);
        }
        volatile unsigned xx = x;
        volatile float yy = y;
      }
    }).detach();
  }
#endif


  std::chrono::high_resolution_clock clk;
  auto start = clk.now();

  for (int i = 0; i < REPEATS; ++i)
    oneRegion(i);

  std::cout << (clk.now() - start).count() / 1000ull / 1000ull << " ms for " << REPEATS << " iterations" << std::endl;

  double averageThreadNum = double(threadNumSum) / threadNumCount;
  std::cout << "Entered " << threadNumCount << " parallel regions with " << averageThreadNum << " threads each on average." << std::endl;

  std::getchar();

  return 0;
}

编译器版本:用于x64的Microsoft(R)C/C ++优化编译器版本19.16.27024.1

Compiler version: Microsoft (R) C/C++ Optimizing Compiler Version 19.16.27024.1 for x64

例如gcc,此程序将为omp_set_dynamic(1)打印明显低于omp_set_dynamic(0)averageThreadNum.但是,在MSVC上,尽管性能相差30%(170s与230s),但在两种情况下都显示了相同的值.

On e.g. gcc, this program will print a significantly lower averageThreadNum for omp_set_dynamic(1) than for omp_set_dynamic(0). But on MSVC, the same value is shown in both cases, despite a 30% performance difference (170s vs 230s).

这怎么解释?

推荐答案

在Visual C ++中,在本示例中,使用omp_set_dynamic(1)减少了执行循环的线程数,这说明了性能差异.

In Visual C++, the number of threads executing the loop does get reduced with omp_set_dynamic(1) in this example, which explains the performance difference.

但是,与对标准(以及Visual C ++文档)的任何善意解释相反, omp_get_num_threads没有报告这种减少..

However, contrary to any good-faith interpretation of the standard (and Visual C++ docs), omp_get_num_threads does not report this reduction.

弄清楚MSVC实际上每个并行区域使用多少线程的唯一方法是在每次循环迭代(或并行任务)中检查omp_get_thread_num.以下是一种实现这种方法的方法,它几乎没有循环内的性能开销:

The only way to figure out how many threads MSVC actually uses for each parallel region is to inspect omp_get_thread_num on every loop iteration (or parallel task). The following would be one way to do it with little in-loop performance overhead:

// std::hardware_destructive_interference_size is not available in gcc or clang, also see comments by Peter Cordes:
// https://stackoverflow.com/questions/39680206/understanding-stdhardware-destructive-interference-size-and-stdhardware-cons
struct alignas(2 * std::hardware_destructive_interference_size) NoFalseSharing
{
    int flagValue = 0;
};

void foo()
{
  std::vector<NoFalseSharing> flags(omp_get_max_threads());

#pragma omp parallel for
  for (int j = 0; j < count; ++j)
  {
    flags[omp_get_thread_num()].flagValue = 1;

    // Your real loop body
  }

  int realOmpNumThreads = 0;
  for (auto flag : flags)
    realOmpNumThreads += flag.flagValue;
}

实际上,您会发现realOmpNumThreads在Visual C ++上与omp_set_dynamic(1)并行区域内的omp_get_num_threads()产生明显不同的值.

Indeed, you will find realOmpNumThreads to yield significantly different values from the omp_get_num_threads() inside the parallel region with omp_set_dynamic(1) on Visual C++.

有人可能会说技术上

  • "团队中执行并行区域的线程数"和
  • "用于执行即将到来的并行区域的线程数"
  • "the number of threads in the team executing a parallel region" and
  • "the number of threads that are used for executing upcoming parallel regions"

字面上并不相同.

在我看来,这是对该标准的荒谬解释,因为其意图非常明确,并且该标准没有理由说"执行并行区域的团队中的线程数保持"固定(在该并行区域的持续时间内),如果此数字与omp_set_dynamic的功能无关,则在本部分中omp_get_num_threads函数报告 .

This is a nonsensical interpretation of the standard in my view, because the intent is very clear and there is no reason for the standard to say "The number of threads in the team executing a parallel region stays fixed for the duration of that parallel region and is reported by the omp_get_num_threads function" in this section if this number is unrelated to the functionality of omp_set_dynamic.

但是,可能是MSVC决定不影响团队中的线程数,而只是为执行该操作的子集不分配循环迭代 c2>为便于实施.

However, it could be that MSVC decided to keep the number of threads in a team unaffected and just assign no loop iterations for execution to a subset of them under omp_set_dynamic(1) for ease of implementation.

无论是哪种情况:在Visual C ++中不要信任omp_get_num_threads.

Whatever the case may be: Do not trust omp_get_num_threads in Visual C++.

这篇关于为什么omp_set_dynamic(1)从不调整线程数(在Visual C ++中)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆