多核CPU:避免令人失望的可扩展性的编程技术 [英] many-core CPU's: Programming techniques to avoid disappointing scalability

查看：224 发布时间：2020/5/13 2:26:07 parallel-processing cpu multicore numa

本文介绍了多核CPU:避免令人失望的可扩展性的编程技术的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们刚刚购买了一台32核Opteron机器，但获得的加速却有点令人失望:超过24个线程后，我们根本看不到加速(实际上总体上变慢了)，经过大约6个线程后，它显着降低了速度. -线性.

We've just bought a 32-core Opteron machine, and the speedups we get are a little disappointing: beyond about 24 threads we see no speedup at all (actually gets slower overall) and after about 6 threads it becomes significantly sub-linear.

我们的应用程序是非常线程友好的:我们的工作分解为大约170,000个小任务，每个小任务可以分别执行，每个任务需要5到10秒钟.它们都从大小约为4Gb的同一内存映射文件中读取.他们偶尔会对其进行写入，但每次写入可能会读取10,000次-我们在170,000个任务的每个末端都只写入了一点数据.写入受锁保护.分析表明，锁不是问题.线程在非共享对象中各自使用大量的JVM内存，并且它们对共享的JVM对象的访问非常少，其中只有一小部分访问涉及写入.

Our application is very thread-friendly: our job breaks down into about 170,000 little tasks which can each be executed separately, each taking 5-10 seconds. They all read from the same memory-mapped file of size about 4Gb. They make occasional writes to it, but it might be 10,000 reads to each write - we just write a little bit of data at the end of each of the 170,000 tasks. The writes are lock-protected. Profiling shows that the locks are not a problem. The threads use a lot of JVM memory each in non-shared objects and they make very little access to shared JVM objects and of that, only a small percentage of accesses involve writes.

我们正在Linux上用Java编程，并启用了NUMA.我们有128Gb RAM.我们有2个Opteron CPU(型号6274)，每个都有16个内核.每个CPU有2个NUMA节点.在Intel四核(即8核)上运行的同一作业几乎线性扩展至8个线程.

We're programming in Java, on Linux, with NUMA enabled. We have 128Gb RAM. We have 2 Opteron CPU's (model 6274) of 16 cores each. Each CPU has 2 NUMA nodes. The same job running on an Intel quad-core (i.e. 8 cores) scaled nearly linearly up to 8 threads.

我们已尝试将只读数据复制为每个线程一个，以希望大多数查找都可以在NUMA节点上进行，但是我们发现并没有提高速度.

We've tried replicating the read-only data to have one-per-thread, in the hope that most lookups can be local to a NUMA node, but we observed no speedup from this.

具有32个线程，"top"显示CPU的74％"us"(用户)和大约23％的"id"(idle).但是没有睡眠，几乎没有磁盘I/O.使用24个线程，我们可以获得83％的CPU使用率.我不确定如何解释空闲"状态-这是否意味着正在等待内存控制器"?

With 32 threads, 'top' shows the CPU's 74% "us" (user) and about 23% "id" (idle). But there are no sleeps and almost no disk i/o. With 24 threads we get 83% CPU usage. I'm not sure how to interpret 'idle' state - does this mean 'waiting for memory controller'?

我们尝试打开和关闭NUMA(我指的是需要重新引导的Linux级别设置)，但没有发现任何区别.启用NUMA时，"numastat"仅显示分配和访问未命中"的大约5％(95％的缓存未命中是NUMA节点本地的). 但是，在Java命令行标志中添加"-XX:+ useNUMA"可以使我们提高10％.

We tried turning NUMA on and off (I'm referring to the Linux-level setting that requires a reboot) and saw no difference. When NUMA was enabled, 'numastat' showed only about 5% of 'allocation and access misses' (95% of cache misses were local to the NUMA node). But adding "-XX:+useNUMA" as a java commandline flag gave us a 10% boost.

我们有一个理论是，我们将内存控制器最大化，因为我们的应用程序使用大量RAM，并且我们认为很多缓存未命中.

One theory we have is that we're maxing out the memory controllers, because our application uses a lot of RAM and we think there are a lot of cache misses.

我们该如何做(a)加速程序以实现线性可伸缩性，或者(b)诊断正在发生的事情?

也:(c)我如何解释最高"结果-空闲"是指在存储控制器上被阻止"吗? (d)Opteron与Xeon的特性有何不同?

Also: (c) how do I interpret the 'top' result - does 'idle' mean 'blocked on memory controllers'? and (d) is there any difference in the characteristics of Opteron vs Xeon's?

推荐答案

我也有一台32核Opteron机器，带有8个NUMA节点(4x6128处理器，Mangy Cours，而不是Bulldozer)，并且我也遇到过类似的问题.

I also have a 32 core Opteron machine, with 8 NUMA nodes (4x6128 processors, Mangy Cours, not Bulldozer), and I have faced similar issues.

我认为您的问题的答案由顶部显示的2.3％"sys"时间暗示.以我的经验，这个sys时间是系统在内核中等待锁定所花费的时间.当线程无法获得锁时，它将处于空闲状态，直到进行下一次尝试. sys和空闲时间都是锁争用的直接结果.您说您的探查器未显示锁定是问题所在.我的猜测是，由于某种原因，导致问题锁的代码未包含在配置文件结果中.

I think the answer to your problem is hinted at by the 2.3% "sys" time shown in top. In my experience, this sys time is the time the system spends in the kernel waiting for a lock. When a thread can't get a lock it then sits idle until it makes its next attempt. Both the sys and idle time are a direct result of lock contention. You say that your profiler is not showing locks to be the problem. My guess is that for some reason the code causing the lock in question is not included in the profile results.

在我的情况下，锁争用的一个重要原因不是我实际上在进行的处理，而是将工作分配给每个线程的工作调度程序.该代码使用锁来跟踪哪个线程正在执行哪个工作.我对这个问题的解决方案是重写我的工作计划程序，以避免互斥体，我已经读过它不能扩展到超过8-12个内核，而是使用gcc内置原子(我在Linux上用C编程).原子操作实际上是一种非常细粒度的锁，可以在高内核数的情况下更好地扩展.就您而言，如果您的工作包裹确实确实要花费5到10秒钟，这似乎对您来说就不那么重要了.

In my case a significant cause of lock contention was not the processing I was actually doing but the work scheduler that was handing out the individual pieces of work to each thread. This code used locks to keep track of which thread was doing which piece of work. My solution to this problem was to rewrite my work scheduler avoiding mutexes, which I have read do not scale well beyond 8-12 cores, and instead use gcc builtin atomics (I program in C on Linux). Atomic operations are effectively a very fine grained lock that scales much better with high core counts. In your case if your work parcels really do take 5-10s each it seems unlikely this will be significant for you.

我也遇到了malloc的问题，在高内核数情况下，malloc遇到了可怕的锁定问题，但是我无法回想起这是否还导致了sys&顶部是闲置的数字，还是只是使用Mike Dunlavey的调试器配置文件方法显示的数字(

I also had problems with malloc, which suffers horrible lock issues in high core count situations, but I can't, off the top of my head, remember whether this also led to sys & idle figures in top, or whether it just showed up using Mike Dunlavey's debugger profiling method (How can I profile C++ code running in Linux?). I suspect it did cause sys & idle problems, but I draw the line at digging through all my old notes to find out :) I do know that I now avoid runtime mallocs as much as possible.

我的最佳猜测是，您正在使用的某些库代码在您不知情的情况下实现了锁，未包含在性能分析结果中，并且无法很好地扩展到高内核数的情况.当心内存分配器！

My best guess is that some piece of library code you are using implements locks without your knowledge, is not included in your profiling results, and is not scaling well to high core-count situations. Beware memory allocators!

这篇关于多核CPU:避免令人失望的可扩展性的编程技术的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

多核CPU:避免令人失望的可扩展性的编程技术 [英] many-core CPU's: Programming techniques to avoid disappointing scalability

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

多核CPU:避免令人失望的可扩展性的编程技术 [英] many-core CPU&#39;s: Programming techniques to avoid disappointing scalability

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

多核CPU:避免令人失望的可扩展性的编程技术 [英] many-core CPU's: Programming techniques to avoid disappointing scalability

登录关闭