缓存行如何工作? [英] How do cache lines work?

查看:99
本文介绍了缓存行如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解处理器会通过缓存线将数据带入缓存,例如,在我的Atom处理器上,无论所读取的实际数据大小如何,它一次都会带来约64个字节.

我的问题是:

想象一下您需要从内存中读取一个字节,这64个字节将被带入缓存吗?

我看到的两种可能性是,这64个字节从感兴趣的字节下方最接近的64个字节边界处开始,或者这64个字节以某种预定的方式分布在该字节周围(例如,下半部分,下半部分以上或所有以上).

是哪个?

解决方案

如果包含要加载的字节或单词的高速缓存行尚未存在于高速缓存中,则您的CPU将请求从高速缓存开始的64个字节线边界(您所需要的最大地址下方的最大地址是64的倍数).

现代PC内存模块一次传输64位(8字节),突发八次传输中,因此一个命令会触发从内存中读取或写入完整的缓存行. (DDR1/2/3/4 SDRAM突发传输大小最多可配置为64B; CPU将选择突发传输大小以匹配其缓存行大小,但是64B是常见的)

根据经验,如果处理器无法预测内存访问(并预取),则检索过程可能需要约90纳秒或约250个时钟周期(从CPU知道地址到接收CPU的时间).数据).

相比之下,在现代x86 CPU上,L1缓存的命中具有3或4个周期的加载使用延迟,而存储重载具有4或5个周期的存储转发延迟.在其他架构上,情况也是如此.

进一步阅读:Ulrich Drepper的每个程序员应该了解的内存.软件预取建议有点过时:现代的硬件预取器更智能,并且超线程比P4时代要好(因此预取线程通常是浪费的).另外,标签Wiki的问题具有许多性能链接对于那种架构.

I understand that the processor brings data into the cache via cache lines, which - for instance, on my Atom processor - brings in about 64 bytes at a time, whatever the size of the actual data being read.

My question is:

Imagine that you need to read one byte from memory, which 64 bytes will be brought into the cache?

The two possibilities I can see is that, either the 64 bytes start at the closest 64 bytes boundary below the byte of interest, or the 64 bytes are spread around the byte in some predetermined way (for instance, half under, half above, or all above).

Which is it?

解决方案

If the cache line containing the byte or word you're loading is not already present in the cache, your CPU will request the 64 bytes that begin at the cache line boundary (the largest address below the one you need that is multiple of 64).

Modern PC memory modules transfer 64 bits (8 bytes) at a time, in a burst of eight transfers, so one command triggers a read or write of a full cache line from memory. (DDR1/2/3/4 SDRAM burst transfer size is configurable up to 64B; CPUs will select the burst transfer size to match their cache line size, but 64B is common)

As a rule of thumb, if the processor can't forecast a memory access (and prefetch it), the retrieval process can take ~90 nanoseconds, or ~250 clock cycles (from the CPU knowing the address to the CPU receiving data).

By contrast, a hit in L1 cache has a load-use latency of 3 or 4 cycles, and a store-reload has a store-forwarding latency of 4 or 5 cycles on modern x86 CPUs. Things are similar on other architectures.

Further reading: Ulrich Drepper's What Every Programmer Should Know About Memory. The software-prefetch advice is a bit outdated: modern HW prefetchers are smarter, and hyperthreading is way better than in P4 days (so a prefetch thread is typically a waste). Also, the tag wiki has lots of performance links for that architecture.

这篇关于缓存行如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆