并行code糟糕的可扩展性 [英] Parallel code bad scalability

查看:173
本文介绍了并行code糟糕的可扩展性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我一直在分析我的并行计算如何真正加快16核处理器。而我的结论通式 - 更多的线程,你必须每个核心少的速度你 - 是我的尴尬。这里是我的CPU负载和处理速度的图:

Recently I've been analyzing how my parallel computations actually speed up on 16-core processor. And the general formula that I concluded - the more threads you have the less speed per core you get - is embarassing me. Here are the diagrams of my cpu load and processing speed:

所以,你可以看到,处理器的负载增加,但增加的速度慢得多。我想知道为什么这样的效果发生,以及如何获得的不可扩展的行为的原因。 我做了一定要使用服务器GC模式。 我确信,我尽快并行适当的code为code确实比

So, you can see that processor load increases, but speed increases much slower. I want to know why such an effect takes place and how to get the reason of unscalable behaviour. I've made sure to use Server GC mode. I've made sure that I'm parallelizing appropriate code as soon as code does nothing more than

  • 从RAM中的数据装载(服务器96 GB的内存,交换文件应该不会被击中)
  • 在不执行复杂计算
  • 在内存中存储数据

我有异形我的应用程序仔细,没有发现瓶颈 - 看起来像每一个操作变得更慢的线程数量的增长。

I've profiled my application carefully and found no bottlenecks - looks like each operation becomes slower as thread number grows.

我卡住了,这有什么错我的情况?

I'm stuck, what's wrong with my scenario?

我用.NET 4.0任务并行库。

I use .Net 4 Task Parallel Library.

推荐答案

的关键线性可扩展性 - 在一到两个核心去的背景下加倍的吞吐量 - 是使用共享资源尽可能少。这意味着:

The key to a linear scalability - in the context of where going from one to two cores doubles the throughput - is to use shared resources as little as possible. This means:

  • 请不要使用超线程(因为这两个线程共享相同的核心资源)
  • 在配合每一个线程到一个特定的核心(否则操作系统将玩弄 内核之间的线程)
  • 请不要使用多个线程之外还有核心(操作系统将换入或换 出)
  • 留在里面的核心自己的高速缓存 - 时下L1&放大器;二级高速缓存
  • 请不要冒险进入L3高速缓存或内存,除非它是绝对 必要
  • 在最小化/节约临界区/同步使用
  • don't use hyperthreading (because the two threads share the same core resource)
  • tie every thread to a specific core (otherwise the OS will juggle the threads between cores)
  • don't use more threads than there are cores (the OS will swap in and out)
  • stay inside the core's own caches - nowadays the L1 & L2 caches
  • don't venture into the L3 cache or RAM unless it is absolutely necessary
  • minimize/economize on critical section/synchronization usage

如果你已经走到这一步,你可能已经异形和手工调整你的code了。

If you've come this far you've probably profiled and hand-tuned your code too.

线程池是一种妥协,而不是适合不妥协,高性能应用。总线程控制。

Thread pools are a compromise and not suited for uncompromising, high-performance applications. Total thread control is.

不要担心操作系统的调度。如果你的应用程序是CPU密集型的,长的计算是主要做当地的L1和放大器; L2存储器访问它是一个性能更好的选择,以配合每个线程自身核心。当然,操作系统将进来,但比起工作,通过你的线程正在执行的操作系统的工作是可以忽略不计。

Don't worry about the OS scheduler. If your application is CPU-bound with long computations that mostly does local L1 & L2 memory accesses it's a better performance bet to tie each thread to its own core. Sure the OS will come in but compared to the work being performed by your threads the OS work is negligible.

另外,我应该说,我穿的经验主要是从Windows NT引擎的机器。

Also I should say that my threading experience is mostly from Windows NT-engine machines.

_______EDIT_______

_______EDIT_______

不是所有的内存访问都与读取和写入数据(见上文评论)。一个经常被忽略的内存访问是获取code的执行。所以,我对住里面的核心本身的缓存声明意味着确保所有必要的数据和code存在于这些高速缓存。也请记住,即使是很简单的OO code可生成隐藏的调用库函数。在这方面(的code代部),面向对象和跨preTED code是比所见即所得少了很多可能是C(一般所见即所得),或者,当然,组件(共所见即所得)。

Not all memory accesses have to do with data reads and writes (see comment above). An often overlooked memory access is that of fetching code to be executed. So my statement about staying inside the core's own caches implies making sure that ALL necessary data AND code reside in these caches. Remember also that even quite simple OO code may generate hidden calls to library routines. In this respect (the code generation department), OO and interpreted code is a lot less WYSIWYG than perhaps C (generally WYSIWYG) or, of course, assembly (totally WYSIWYG).

这篇关于并行code糟糕的可扩展性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆