x86_64 CPU使用相同的chache行通过共享内存在2个进程之间通信? [英] Does x86_64 CPU use the same chache lines for communicate between 2 processes via shared memory?

查看:181
本文介绍了x86_64 CPU使用相同的chache行通过共享内存在2个进程之间通信?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已知现代x86_64上的所有级别的高速缓存L1 / L2 / L3都是。在Nehalem之前的CPU,为每个核心独立的缓存,我认为必须冲洗到DRAM的一致性。 IDK,如果数据可以使用用于检测一致性问题的相同协议直接从缓存发送到缓存。






被映射到不同虚拟地址的高速缓存线将总是在L1高速缓存的相同集合中。参见讨论中的评论:L2 / L3缓存是物理索引以及物理标记,所以混叠永远不是一个问题。 (只有L1可以通过虚拟索引获得速度优势,直到地址转换完成后,才会检测到L1高速缓存未命中,因此物理地址已准备就绪,可以探测更高级别的高速缓存。)



还要注意,评论中的讨论错误地提到Skylake降低了L1缓存的关联性。实际上,它是 Skylake L2 缓存的关联性比以前少(4路,从8路在SnB / Haswell / Broadwell下降)。 L1仍然是32kiB 8路总是:使关联性的最大大小保持页选择地址位在索引之外。



另请参阅此问题的另一个答案HT线程在同一个核心通过L1通信。我说的更多关于缓存的方式和集。 (感谢Voo,我刚刚纠正它说,缓存索引选择一个集合,而不是一种方式。:P)


As known all levels of cache L1/L2/L3 on modern x86_64 are virtually indexed, physically tagged. And all cores communicate via Last Level Cache - cache-L3 by using cache coherent protocol MOESI/MESIF over QPI/HyperTransport.

For example, Sandybridge family CPU has 4 - 16 way cache L3 and page_size 4KB, then this allows to exchange the data between concurrent processes which are executed on different cores via a shared memory. This is possible because cache L3 can't contain the same physical memory area as a page of process 1 and as a page of process 2 at the same time.

Does this mean that every time when the process-1 requests the same shared memory region, then the process-2 flushes its cache-lines of page into the RAM, and then process-1 loaded the same memory region as cache-lines of page in virtual space of process-1? It's really slow or processor uses some optimizations?

Does modern x86_64 CPU use the same cache lines, without any flushes, to communicate between 2 processes with different virtual spaces via a shared memory?

Sandy Bridge Intel CPU - cache L3:

  • 8 MB - cache size
  • 64 B - cache line size
  • 128 K - lines (128 K = 8 MB / 64 B)
  • 16-way
  • 8 K - number sets of ways (8 K = 128 K lines / 16-way)
  • 13 bits [18:6] - of virtual address (index) defines current set number (this is tag)
  • 512 K - each the same (virtual address / 512 K) compete for the same set (8 MB / 16-way)
  • low 19 bits - significant for determining the current set number

  • 4 KB - standard page size

  • only low 12 bits - the same in virtual and physical addresses for each address

We have 7 missing bits [18:12] - i.e. we need to check (7^2 * 16-way) = 1024 cache lines. This is the same as 1024-way cache - so this is very slow. Does this mean, that cache L3 is (physically indexed, physically tagged)?

Summary of missing bits in virtual address for tag (page size 8 KB - 12 bits):

  • L3 (8 MB = 64 B x 128 K lines), 16-way, 8 K sets, 13 bits tag [18:6] - missing 7 bits
  • L2 (256 KB = 64 B x 4 K lines), 8-way, 512 sets, 9 bits tag [14:6] - missing 3 bits
  • L1 (32 KB = 64 B x 512 lines), 8-way, 64 sets, 6 bits tag [11:6] - no missing bits

It should be:

  • L3 / L2 (physically indexed, physically tagged) used after TLB lookup
  • L1 (virtually indexed, physically tagged)

解决方案

This is possible because cache L3 can't contain the same physical memory area as page of process 1 and as page of process 2 at the same time.

Huh what? If both processes have a page mapped, they can both hit in the cache for the same line of physical memory.

That's part of the benefit of Intel's multicore designs using large inclusive L3 caches. Coherency only requires checking L3 tags to find cache lines in E or M state in another core's L2 or L1 cache.

Getting data between two cores only requires writeback to L3. I forget where this is documented. Maybe http://agner.org/optimize/. CPUs before Nehalem that had separate caches for each core I think had to flush to DRAM for coherency. IDK if the data could be sent directly from cache to cache with the same protocol used to detect coherency issues.


The same cache line mapped to different virtual addresses will always go in the same set of the L1 cache. See discussion in comments: L2 / L3 caches are physically-index as well as physically tagged, so aliasing is never a problem. (Only L1 could get a speed benefit from virtual indexing. L1 cache misses aren't detected until after address translation is finished, so the physical address is ready in time to probe higher level caches.)

Also note that the discussion in comments incorrectly mentions Skylake lowering the associativity of L1 cache. In fact, it's the Skylake L2 cache that's less associative than before (4-way, down from 8-way in SnB/Haswell/Broadwell). L1 is still 32kiB 8-way as always: the maximum size for that associativity that keeps the page-selection address bits out of the index. So there's no mystery after all.

Also see another answer to this question about HT threads on the same core communicating through L1. I said more about cache ways and sets there. (And thanks to Voo, I just corrected it to say that the cache index selects a set, not a way. :P)

这篇关于x86_64 CPU使用相同的chache行通过共享内存在2个进程之间通信?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆