L1缓存通常具有拆分设计,但是L2,L3缓存具有统一设计,为什么? [英] L1 caches usually have split design, but L2, L3 caches have unified design, why?

查看:64
本文介绍了L1缓存通常具有拆分设计,但是L2,L3缓存具有统一设计,为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读

拆分L1的大部分原因是在两个缓存之间分配必要的读/写端口(以及带宽),并使它们物理上靠近数据.加载/存储与指令获取部分的管道.

也使L1d处理字节加载/存储(在某些ISA上,未对齐的更宽的加载/存储).在希望处理最大数量的x86 CPU上效率(不是包含单词的RMW),英特尔的L1d只能使用奇偶校验,不能使用ECC.L1i只需要处理固定宽度的提取,通常是简单的事情,例如对齐的16字节块,并且总是干净"的.因为它是只读的,所以只需要检测错误(不正确),然后重新获取即可.因此,它可以减少每行数据的开销,就像每8或16个字节只有几个奇偶校验位一样.

请参阅this thread.

Based on my understanding the primary advantage of the split design is: The split design enables us to place the instruction cache close to the instruction fetch unit and the data cache close to the memory unit, thereby simultaneously reducing the latencies of both. And the primary disadvantage is: Combined space of the instruction and data caches may not be efficiently utilized. Simulations have shown that a unified cache of the same total size has a higher hit rate.

I, however, couldn't find an intuitive answer to the question "Why (at-least in most modern processors) L1 caches follow the split design, but the L2/L3 caches follow the unified design.)"

解决方案

Most of the reason for split L1 is to distribute the necessary read/write ports (and thus bandwidth) across two caches, and to place them physically close to data load/store vs. instruction-fetch parts of the pipeline.

Also for L1d to handle byte load/store (and on some ISAs, unaligned wider loads/stores). On x86 CPUs which want to handle that with maximum efficiency (not an RMW of the containing word(s)), Intel's L1d may only use parity, not ECC. L1i only has to handle fixed-width fetches, often something simple like an aligned 16-byte chunk, and it's always "clean" because it's read-only, so it only needs to detect errors (not correct), and just re-fetch. So it can have less overhead for each line of data, like only a couple parity bits per 8 or 16 bytes.

See Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? re: it being impossible to build one large unified L1 cache with twice the capacity, same latency, and sum total of the bandwidth as a split L1i/d. (At least prohibitively more expensive for power due to size and number of read/write ports, but potentially actually impossible for latency because of physical-distance reasons.)

None of those factors are important for L2 (or exist at all in the case of unaligned / byte stores). Total capacity that can be used for code or data is most useful there, competitively shared based on demand.

It would be very rare for any workload to have lots of L1i and L1d misses in the same clock cycle, because frequent code misses mean the front end stalls, and the back-end will run out of load/store instructions to execute. (Frequent L1i misses are rare, but frequent L1d misses do happen in some normal workloads, e.g. looping over an array that doesn't fit in L1d, or a large hash table or other more scattered access pattern.) Anyway, this means data can get most of the total L2 bandwidth budget under normal conditions, and a unified L2 still only needs 1 read port.

@Hadi's answer that you linked does cover most of these reasons, but I guess it doesn't hurt to write a simplified / summary answer.

这篇关于L1缓存通常具有拆分设计,但是L2,L3缓存具有统一设计,为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆