为什么指令高速缓存对齐可提高集合关联高速缓存实现中的性能? [英] Why does instruction cache alignment improve performance in set associative cache implementations?

查看:76
本文介绍了为什么指令高速缓存对齐可提高集合关联高速缓存实现中的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对指令高速缓存对齐有疑问.我听说对于微优化,对齐循环以使它们适合高速缓存行可以稍微提高性能.我不明白为什么这样做会做任何事情.

I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything.

我了解高速缓存命中的概念及其在计算速度中的重要性.

I understand the concept of cache hits and their importance in computing speed.

但是似乎在集合关联缓存中,相邻的代码块不会映射到相同的缓存集.因此,如果循环越过代码块,CPU仍将获得高速缓存命中,因为该相邻块尚未因执行前一个块而被驱逐.这两个块都可能在循环期间保持高速缓存.

But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since that adjacent block has not been evicted by the execution of the previous block. Both blocks are likely to remain cached during the loop.

因此,我能想出的是,如果声称对齐有帮助的事实是真的,那一定是出于某种其他影响.

So all I can figure is if there is truth in the claim that alignment can help, it must be from some sort of other effect.

切换高速缓存行是否有成本?

Is there a cost in switching cache lines?

缓存命中率是否有所不同,一个命中率与另一个命中率是当前读取的同一行?

Is there a difference in cache hits, one where you get a hit and one where you hit the same cache line you're currently reading from?

推荐答案

在更少的缓存行中保留整个函数(或函数的重要部分,即通过它的快速路径)可以减少I缓存的占用空间.因此,它可以减少高速缓存未命中的次数,包括在大多数高速缓存处于冷态时的启动时.在高速缓存行的末尾之前有一个循环结束可能会给硬件预取时间来获取下一个.

Keeping a whole function (or the hot parts of a function, i.e. the fast path through it) in fewer cache lines reduces I-cache footprint. So it can reduce the number of cache misses, including on startup when most of the cache is cold. Having a loop end before the end of a cache line could give HW prefetching time to fetch the next one.

访问L1i缓存中存在的任何行所花费的时间是相同的.(除非您的缓存使用 way-prediction :这会引入慢速命中"的可能性.请参见 Alpha 21264也是如此的L1指令缓存,其2向关联的64kiB L1i具有分支目标"与顺序"方式.像我一样的方式预测.)

Accessing any line that's present in L1i cache takes takes the same amount of time. (Unless your cache uses way-prediction: that introduces the possibility of a "slow hit". See these slides for a mention and brief description of the idea. Apparently MIPS r10k's L2 cache used it, and so did Alpha 21264's L1 instruction cache with "branch target" vs. "sequential" ways in its 2-way associative 64kiB L1i. Or see any of the academic papers that come up when you google cache way prediction like I did.)

除此之外,其影响并不仅仅是缓存行边界,而是超标量CPU中对齐的指令获取块.您是正确的,其影响并非来自您正在考虑的事物.

Other than that, the effects aren't so much about cache-line boundaries but rather aligned instruction-fetch blocks in superscalar CPUs. You were correct that the effects are not from things you were considering.

请参见现代微处理器《 90分钟指南》!介绍超标量(无序)执行.

See Modern Microprocessors A 90-Minute Guide! for an intro to superscalar (and out-of-order) execution.

许多超标量CPU使用对其I缓存的对齐访问来完成其第一阶段的指令提取.让我们通过考虑具有4字节指令宽度 1 和4宽访存/解码/执行的RISC ISA进行简化.(例如MIPS r10k,尽管IDK如果我要弥补的其他一些东西恰好反映了该微体系结构).

Many superscalar CPUs do their first stage of instruction fetch using aligned accesses to their I-cache. Lets simplify by considering a RISC ISA with 4-byte instruction width1 and 4-wide fetch/decode/exec. (e.g. MIPS r10k, although IDK if some of the other stuff I'm going to make up reflects that microarch exactly).

   ...
 .top_of_loop:
    insn1                ; at address 16*n + 12
      ; 16-byte boundary here
    insn2                ; at address 16*n + 0
    insn3                ; at address 16*n + 4
    b  .top_of_loop      ; at address 16*n + 8

    ... after loop       ; at address 16*n + 12
    ... after loop       ; at address 16*n + 0

在没有任何类型的循环缓冲区的情况下,每次执行时,提取阶段都必须从I缓存中提取一条循环指令.但这每次迭代至少需要2个周期,因为该循环跨越了两个16字节对齐的提取块.它无法一次未对齐地获取16个字节的指令.

Without any kind of loop buffer, the fetch stage has to fetch the loop instructions from I-cache one for every time it executes. But this takes a minimum of 2 cycles per iteration because the loop spans two 16-byte aligned fetch blocks. It's not capable of fetching the 16 bytes of instructions in one unaligned fetch.

但是,如果我们对齐循环的顶部,则可以在单个循环中获取它,如果循环主体没有其他瓶颈,则允许循环以1个循环/迭代的速度运行.

But if we align the top of the loop, it can be fetched in a single cycle, allowing the loop to run at 1 cycle / iteration if the loop body doesn't have other bottlenecks.

   ...
    nop                  ; at address 16*n + 12         ; NOP padding for alignment
 .top_of_loop:       ; 16-byte boundary here
    insn1                ; at address 16*n + 0
    insn2                ; at address 16*n + 4
    insn3                ; at address 16*n + 8
    b  .top_of_loop      ; at address 16*n + 12

    ... after loop       ; at address 16*n + 0
    ... after loop       ; at address 16*n + 4

使用更大的循环(不是4条指令的倍数),在某个地方仍将进行部分浪费的提取.不过,最好不要将其放在循环的顶部.尽早将更多指令放入流水线中,有助于CPU查找和利用更多指令级并行性,以解决指令提取中并非瓶颈的代码.

With a larger loop that's not a multiple of 4 instructions, there's still going to a partially-wasted fetch somewhere. It's generally best that it's not the top of the loop, though. Getting more instructions into the pipeline sooner rather than later helps the CPU find and exploit more instruction-level parallelism, for code that isn't purely bottlenecked on instruction-fetch.

通常,将分支目标(包括函数入口点)对齐16可能是一个胜利(以较低的代码密度来增加I缓存压力).如果您位于1或2条指令之内,则可以权衡取舍到16的下一个倍数.例如因此在最坏的情况下,提取块至少包含2或3条有用的指令,而不仅仅是1条.

In general, aligning branch targets (including function entry points) by 16 can be a win (at the cost of greater I-cache pressure from lower code density). A useful tradeoff can be padding to the next multiple of 16 if you're within 1 or 2 instructions. e.g. so in the worst case, a fetch block contains at least 2 or 3 useful instructions, not just 1.

这就是GNU汇编器支持 .p2align 4 ,,8 :如果距离8个字节或更近,则填充到下一个2 ^ 4边界.实际上,GCC确实针对某些目标/体系结构使用了该指令,具体取决于调整选项/默认值.

This is why the GNU assembler supports .p2align 4,,8 : pad to the next 2^4 boundary if it's 8 bytes away or closer. GCC does in fact use that directive for some targets / architectures, depending on tuning options / defaults.

在通常情况下,对于非循环分支,您也不想跳到高速缓存行的末尾.然后,您可能会立即遇到另一个I-cache丢失.

In the general case for non-loop branches, you also don't want to jump near the end of a cache line. Then you might have another I-cache miss right away.

脚注1:

该原理也适用于具有可变宽度指令的现代x86,至少在它们解码uop高速缓存未命中时会迫使它们实际从L1I高速缓存中获取x86机器代码.并适用于较旧的超标量x86,例如奔腾III或K8,而没有uop缓存或环回缓冲区(它们可以使循环高效,而无需对齐).

The principle also applies to modern x86 with its variable-width instructions, at least when they have decoded-uop cache misses forcing them to actually fetch x86 machine code from L1I-cache. And applies to older superscalar x86 like Pentium III or K8 without uop caches or loopback buffers (that can make loops efficient regardless of alignment).

但是x86解码是如此困难,以至于它需要多个流水线阶段,例如到一些简单的 find 指令边界,然后将指令组提供给解码器.如果预解码能够赶上,那么只有初始的获取块是对齐的,并且级之间的缓冲区可以在解码器中隐藏气泡.

But x86 decoding is so hard that it takes multiple pipeline stages, e.g. to some to simple find instruction boundaries and then feed groups of instructions to the decoders. Only the initial fetch-blocks are aligned and buffers between stages can hide bubbles from the decoders if pre-decode can catch up.

https://www.realworldtech.com/merom/4/显示了Core2前端的详细信息:与PPro/PII/PIII相同的16字节提取块,提供了一个预解码阶段,该阶段可以扫描多达32个字节,并找到最多6条指令IIRC之间的边界.然后,这会馈入另一个缓冲区,导致进入完整解码阶段,该阶段可以将最多4条指令(5条带有test或cmp + jcc的宏融合的指令)解码为最多7 uops.

https://www.realworldtech.com/merom/4/ shows the details of Core2's front-end: 16-byte fetch blocks, same as PPro/PII/PIII, feeding a pre-decode stage that can scan up to 32 bytes and find boundaries between up to 6 instructions IIRC. That then feeds another buffer leading to the full decode stage which can decode up to 4 instructions (5 with macro-fusion of test or cmp + jcc) into up to 7 uops...

Agner Fog的微体系结构指南包含一些有关优化x86 asm以在Pentium Pro上获取/解码瓶颈的详细信息./II与Core2/Nehalem与Sandybridge系列,以及AMD K8/K10与Bulldozer与Ryzen.

Agner Fog's microarch guide has some detailed info about optimizing x86 asm for fetch/decode bottlenecks on Pentium Pro/II vs. Core2 / Nehalem vs. Sandybridge-family, and AMD K8/K10 vs. Bulldozer vs. Ryzen.

现代的x86并非总能从对齐中受益.代码对齐会产生一些影响,但是它们通常并不简单,也并非总是有益的.事物的相对对齐可能很重要,但是通常对于诸如分支在预测变量条目中彼此互称或uops如何打包到uop缓存之类的事情而言.

Modern x86 doesn't always benefit from alignment. There are effects from code alignment but they're not usually simple and not always beneficial. Relative alignment of things can matter, but usually for things like which branches alias each other in branch predictor entries, or for how uops pack into the uop cache.

这篇关于为什么指令高速缓存对齐可提高集合关联高速缓存实现中的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆