是什么在这个简单的OpenMP程序限制比例? [英] What limits scaling in this simple OpenMP program?

查看:229
本文介绍了是什么在这个简单的OpenMP程序限制比例?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图理解限制一个48核系统上并行(4xAMD皓龙6348,2.8 GHz的CPU每12个核心)。我写了这个小的OpenMP code来测试什么,我认为将是最好的情况(任务尴尬的并行)的加速:

I'm trying to understand limits to parallelization on a 48-core system (4xAMD Opteron 6348, 2.8 Ghz, 12 cores per CPU). I wrote this tiny OpenMP code to test the speedup in what I thought would be the best possible situation (the task is embarrassingly parallel):

// Compile with: gcc scaling.c -std=c99 -fopenmp -O3                                                                                               

#include <stdio.h>
#include <stdint.h>

int main(){

  const uint64_t umin=1;
  const uint64_t umax=10000000000LL;
  double sum=0.;
#pragma omp parallel for reduction(+:sum)
  for(uint64_t u=umin; u<umax; u++)
    sum+=1./u/u;
  printf("%e\n", sum);

}

我很惊讶地发现,缩放是高度非线性的。大约需要2.9s为code与48线程,3.1s与36线程,3.7s 24线程,4.9s与12线程,57S为code运行1线程上运行。

I was surprised to find that the scaling is highly nonlinear. It takes about 2.9s for the code to run with 48 threads, 3.1s with 36 threads, 3.7s with 24 threads, 4.9s with 12 threads, and 57s for the code to run with 1 thread.

不幸的是我不得不说,有利用一个核心的100%,在计算机上运行一个进程,这样有可能会影响它。这不是我的过程,所以我不能结束它来测试不同,但不知何故,我怀疑这是制作19〜20倍的加速和理想的48X加速之间的区别。

Unfortunately I have to say that there is one process running on the computer using 100% of one core, so that might be affecting it. It's not my process, so I can't end it to test the difference, but somehow I doubt that's making the difference between a 19~20x speedup and the ideal 48x speedup.

要确保它不是一个OpenMP的问题,我跑了程序的两个副本,同时与每个线程24(一个UMIN = 1,UMAX = 50亿,和其他与UMIN = 50亿,UMAX =百亿)。在这种情况下,2.9s后的节目结束的两个副本,所以它是完全一样的运行48螺纹与该程序的一个实例。

To make sure it wasn't an OpenMP issue, I ran two copies of the program at the same time with 24 threads each (one with umin=1, umax=5000000000, and the other with umin=5000000000, umax=10000000000). In that case both copies of the program finish after 2.9s, so it's exactly the same as running 48 threads with a single instance of the program.

什么是与这个简单的程序preventing线性缩放?

What's preventing linear scaling with this simple program?

推荐答案

我终于有机会到基准code。与完全卸载系统:

I finally got a chance to benchmark the code with a completely unloaded system:

有关我用时间表(动态,1000000)动态调度。对于静态调度我使用的默认(双核均匀)。对于线程绑定我用出口GOMP_CPU_AFFINITY =0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

For the dynamic schedule I used schedule(dynamic,1000000). For the static schedule I used the default (evenly between the cores). For thread binding I used export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47".

对于高度非线性缩放此code最主要的原因是因为什么AMD称之为核心实际上不是独立内核。这是REDRUM的部分答案(1)。这是从增速在24线程的突然高原上图清晰可见;它与动态调度真的很明显。这也是从线程绑定,我选择显而易见:原来我写上面会绑定一个可怕的选择,因为你最终在每个模块两个线程

The main reason for the highly nonlinear scaling for this code is because what AMD calls "cores" aren't actually independent cores. This was part (1) of redrum's answer. This is clearly visible in the plot above from the sudden plateau of speedup at 24 threads; it's really obvious with the dynamic scheduling. It's also obvious from the thread binding that I chose: it turns out what I wrote above would be a terrible choice for binding, because you end up with two threads in each "module".

第二大经济放缓来自于与线程的大量数字静态调度。不可避免地存在是最慢和最快的线程之间的不平衡,在运行时引入大的波动时,迭代大块使用默认的静态调度分配。回答这一部分无论是从斯托伊奇的意见和盐的回答就来了。

The second biggest slowdown comes from static scheduling with a large number number of threads. Inevitably there is an unbalance between the slowest and fastest threads, introducing large fluctuations in the run time when the iterations are divided in large chunks with the default static scheduling. This part of the answer came both from Hristo's comments and Salt's answer.

我不知道为什么睿频加速的影响不是更明显(REDRUM的回答的第2部分)。另外,我不是100%肯定在哪里(架空presumably)缩放的最后一位自带丢失(我们得到了22倍的性能,而不是预期的24倍,从线性扩展的数量的模块的)。但在其他方面的问题是pretty很好的回答。

I don't know why the effects of "Turbo Boost" aren't more pronounced (part 2 of Redrum's answer). Also, I'm not 100% certain where (presumably in overhead) the last bit of the scaling comes is lost (we get 22x performance instead of expected 24x from linear scaling in number of modules). But otherwise the question is pretty well answered.

这篇关于是什么在这个简单的OpenMP程序限制比例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆