CPU之间的通信如何发生? [英] How does the communication between CPU happen?

查看:108
本文介绍了CPU之间的通信如何发生?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有关L2/L3缓存的另一个问题解释说,L3可用于进程间通信(IPC).

Another question about L2/L3 caches explained that L3 can be used for inter process communication (IPC).

还有其他方法/途径可以进行这种交流吗?

Are there other methods/pathways for this communication to happen?

之所以似乎还有其他途径,是因为英特尔在其最新的处理器产品系列中将每个内核的L3缓存数量几乎减少了一半(

The reason why it seems that there are other pathways is because Intel nearly halved the amount of L3 cache per core in their newest processor lineup (1.375 MiB per core in SKL-X) vs. previous generations (2.5 MiB per core in Broadwell EP).

每核私有L2从256k增加到1M.

Per-core private L2 increased from 256k to 1M, though.

推荐答案

inter-processor_interrupts,但这不是新的,也没有被普通的多线程软件直接使用.内核可能使用IPI从低功耗睡眠中唤醒另一个内核,或者在该CPU上的某个任务释放了其他任务正在等待的OS辅助锁/互斥锁之后,可能没有通知它高优先级任务变得可运行.

There are inter-processor_interrupts, but that's not new, and not used directly by normal multi-threaded software. The kernel might use an IPI to wake up another core from low-power sleep, or maybe not notify it that a high-priority task became runnable after a task on this CPU released an OS-assisted lock / mutex that other tasks were waiting for.

所以真的没有,没有其他途径.

So really no, there are no other pathways.

减小的大小意味着,如果希望当使用者线程访问L3时数据仍然很热,则必须设计软件以更快地重用数据.但是请注意,L3中唯一的数据不太可能是一个内核写入的数据,而下一内核又将读取该数据.大多数多线程工作负载也涉及大量私有数据.还要注意,SKX L3不包含在内,因此即使从L3驱逐了共享的只读数据,使用它们也可以在内核的L2中保持最新状态.

Reduced size means you have to design your software to reuse data sooner if you want it to still be hot in L3 when a consumer thread gets to it. But note that it's unlikely that the only data in L3 is data that was written by one core and will next be read by another; most multi-threaded workloads involve plenty of private data, too. Also note that SKX L3 is not inclusive, so shared read-only data can stay hot in L2 of the core(s) using it even when it's been evicted from L3.

如果L3巨大且快速,那对开发人员来说真的很好,但事实并非如此.除了减小了L3的大小外,SKX中的带宽和等待时间也比BDW中的差得多.请参阅 @Mysticial的有关y-cruncher性能的评论:

It would be really nice for developers if L3 was gigantic and fast, but it isn't. Besides the reduced size of L3, the bandwidth and latency is also significantly worse in SKX than in BDW. See @Mysticial's comments about y-cruncher performance:

Skylake X上的L3缓存网格仅具有上一代Haswell/Broadwell-EP处理器上L3缓存的大约一半带宽.Skylake X L3缓存是如此之慢,以至于在带宽方面它几乎不比主内存快.因此,出于所有实际目的,它和不存在一样好.

The L3 cache mesh on Skylake X only has about half the bandwidth of the L3 cache on the previous generation Haswell/Broadwell-EP processors. The Skylake X L3 cache is so slow that it's barely faster than main memory in terms of bandwidth. So for all practical purposes, it's as good as non-existant.

他不是在谈论线程之间的通信,而只是在谈论每个线程对独立线程有用的缓存数量.但是AFAIK(生产者/消费者模型)应该非常相似.

He's not talking about communication between threads, just the amount of useful cache per core for independent threads. But AFAIK, a producer/consumer model should be pretty similar.

从软件优化的角度来看,缓存瓶颈带来了一系列新的困难.L2缓存很好.它比以前大4倍,带宽增加了一倍,以跟上AVX512的步伐.但是L3没用.最终结果是,与以前的Haswell/Broadwell代相比,每个内核的可用缓存减少了一半.此外,使用AVX512将SIMD大小增加一倍,使可用缓存比适合缓存的SIMD字数小4倍.

From the software optimization standpoint, the cache bottleneck brings a new set of difficulties. The L2 cache is fine. It is 4x larger than before and has doubled in bandwidth to keep up with the AVX512. But the L3 is useless. The net effect is that the usable cache per core is halved compared to the previous Haswell/Broadwell generations. Furthermore, doubling of the SIMD size with AVX512 makes the usable cache 4x smaller than before in terms of # of SIMD words that fit in cache.

鉴于所有这些,无论生产者/消费者线程是进入L3还是进入主内存,都可能不会产生很大的变化.幸运的是,如果有许多线程处于活动状态,则DRAM具有很高的聚合带宽,速度很快.单线程最大带宽仍然低于Broadwell.

Given all that, it may not make a huge difference whether producer/consumer threads hit in L3 or go to main memory. Fortunately, DRAM is pretty fast with high aggregate bandwidth if many threads are active. Single-thread max bandwidth is still lower than in Broadwell.

SiSoft具有此处.

对于10核(20线程)SKX(i9-7900X CPU @标称3.30GHz),最高结果是一个

For a 10-core (20 thread) SKX (i9-7900X CPU @ nominal 3.30GHz), the highest result is one overclocked to 4.82GHz cores with 3.2GHz memory, achieving an aggregate(?) bandwidth of 105.84GB/s and latency of 54.9ns.

最低的结果之一是使用 4GHz/4.5GHz内核,和2.4GHz IMC :带宽为66.11GB/s,延迟为76.6ns.(滚动到页面底部,以查看同一CPU的其他提交内容.)

One of the lowest results is with 4GHz/4.5GHz cores, and 2.4GHz IMC: 66.11GB/s bandwidth, 76.6ns latency. (Scroll to the bottom of the page to see other submissions for the same CPU).

相比之下,台式机Skylake i7-6700k(4C 8T 4.21GHz,4.1GHz IMC)得分为35.51GB/s和40.5ns .超频的结果是42.72GB/s和36.3ns.

By comparison, a desktop Skylake i7-6700k (4C 8T 4.21GHz, 4.1GHz IMC) scores 35.51GB/s and 40.5ns. Some more overclocked results are 42.72GB/s and 36.3ns.

对于单线程,我认为SKL台式机比SKX快.我认为该基准测试的是10C/20T CPU上20个线程之间的总带宽.

For a single pair of threads, I think SKL-desktop is faster than SKX. I think this benchmark is measuring aggregate bandwidth between 20 threads on the 10C/20T CPU.

单线程基准对于从2MB到8MB的块大小,SKL-X仅显示约20GB/s的速度,与主内存带宽几乎完全相同.图中的Kaby Lake四核i7-7700k可能约为60GB/s.对于SKX,线程间带宽高于单线程带宽是不合理的,除非SiSoft Sandra正在为线程间情况计算负载+存储.(单线程带宽往往会吸引Intel多核CPU:请参阅

This single-threaded benchmark shows only about 20GB/s for SKL-X for block sizes from 2MB to 8MB, pretty much exactly the same as main memory bandwidth. The Kaby Lake quad-core i7-7700k on the graph looks like maybe 60GB/s. It's not plausible that inter-thread bandwidth is higher than single-thread bandwidth for the SKX, unless SiSoft Sandra is counting loads + stores for the inter-thread case. (Single-thread bandwidth tends to suck on Intel many-core CPUs: see the "latency-bound platform" section of this answer. Higher L3 latency means bandwidth is limited by the number of outstanding L1 or L2 misses / prefetch requests.)

另一个复杂之处是,在启用超线程的情况下运行时,如果块大小足够小,则可能会通过L1D/L2进行某些线程间通信.参见什么将用于在使用HT在一个Core上执行的线程之间进行数据交换?,以及

Another complication is that when running with hyperthreading enabled, some inter-thread communication may happen through L1D / L2 if the block size is small enough. See What will be used for data exchange between threads are executing on one Core with HT?, and also What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?.

我不知道基准测试如何将线程固定到逻辑核心,以及它们是否试图避免或最大化同一物理核心的逻辑核心之间的通信.

I don't know how that benchmark pins threads to logical cores, and whether they try to avoid or maximize communication between logical cores of the same physical core.

在设计多线程应用程序时,请针对每个线程中的内存局部性.尝试避免在线程之间传递巨大的内存块,因为即使在以前的CPU中,这样做的效率也较低.SKL-AVX512 aka SKL-SP aka SKL-X aka SKX变得比以前更糟.

When designing a multi-threaded application, aim for memory locality within each thread. Try to avoid passing huge blocks of memory between threads, because that's less efficient even in previous CPUs. SKL-AVX512 aka SKL-SP aka SKL-X aka SKX just makes it worse than before.

在带有标志变量或进度计数器的线程之间进行同步.

Synchronize between threads with flag variables or progress counters.

如果线程之间的内存带宽是最大的瓶颈,则应考虑只在生产者线程中进行工作(尤其是在写入数据时动态进行,而不是在单独的过程中进行),而不是根本不使用单独的线程.也就是说,线程之间的边界之一可能不在您设计的理想位置.

If memory bandwidth between threads is your biggest bottleneck, you should consider just doing the work in the producer thread (especially on the fly as the data is being written, instead of in separate passes), instead of using a separate thread at all. i.e. that maybe one of the boundaries between threads is not in an ideal place in your design.

现实生活中的软件设计非常复杂,有时您最终不得不在较差的选项之间进行选择.

Real life software design is complicated, and sometimes you end up having to choose between poor options.

硬件设计也很复杂,需要进行很多折衷.尽管看起来SKX的L3高速缓存+网状网络似乎比用于中核数芯片的旧环形总线设置要差.大概对于某些工作负载而言,这是最大芯片的胜利.希望子孙后代将具有更好的单核延迟/带宽.

Hardware design is complicated, too, with lots of tradeoffs. Although it appears that SKX's L3 cache + mesh seem to do worse than the old ring bus setup for medium core count chips. Presumably it is a win for the biggest chips for some kinds of workloads. Hopefully future generations will have better single-core latency / bandwidth.

这篇关于CPU之间的通信如何发生?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆