ARM 预取解决方法 [英] ARM prefetch workaround

查看：40 发布时间：2021/11/17 22:25:46 assembly gcc arm armv6 speculative-execution

本文介绍了ARM 预取解决方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到的情况是，某些地址空间很敏感，因为您阅读它会崩溃，因为那里没有人响应该地址.

I have a situation where some of the address space is sensitive in that you read it you crash as there is nobody there to respond to that address.

pop {r3,pc}
bx r0

   0:   e8bd8008    pop {r3, pc}
   4:   e12fff10    bx  r0

   8:   bd08        pop {r3, pc}
   a:   4700        bx  r0

bx 不是由编译器作为指令创建的，而是一个 32 位常量的结果，该常量不适合作为单个指令中的立即数，因此设置了 pc 相对负载.这基本上是文字池.它碰巧有一些类似于 bx 的位.

The bx was not created by the compiler as an instruction, instead it is the result of a 32 bit constant that didnt fit as an immediate in a single instruction so a pc relative load is setup. This is basically the literal pool. And it happens to have bits that resemble a bx.

可以轻松编写测试程序来生成问题.

Can easily write a test program to generate the issue.

unsigned int more_fun ( unsigned int );
unsigned int fun ( void )
{
    return(more_fun(0x12344700)+1);
}

00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   4802        ldr r0, [pc, #8]    ; (c <fun+0xc>)
   4:   f7ff fffe   bl  0 <more_fun>
   8:   3001        adds    r0, #1
   a:   bd10        pop {r4, pc}
   c:   12344700    eorsne  r4, r4, #0, 14

在这种情况下，处理器正在等待从 pop (ldm) 返回的数据移动到下一条指令 bx r0，并在 r0 中的地址处开始预取.哪个挂起 ARM.

What appears to be happening is the processor is waiting on data coming back from the pop (ldm) moves onto the next instruction bx r0 in this case, and starts a prefetch at the address in r0. Which hangs the ARM.

作为人类，我们将 pop 视为一个无条件分支，但处理器不会一直通过管道.

As humans we see the pop as an unconditional branch, but the processor does not it keeps going through the pipe.

预取和分支预测并不是什么新鲜事(在这种情况下我们关闭了分支预测器)，几十年前，并且不仅限于 ARM，还有将 PC 作为 GPR 的指令集的数量以及在某种程度上处理的指令它作为非特殊的很少.

Prefetching and branch prediction are nothing new (we have the branch predictor off in this case), decades old, and not limited to ARM, but the number of instruction sets that have the PC as GPR and instructions that to some extent treat it as non-special are few.

我正在寻找一个 gcc 命令行选项来防止这种情况发生.我无法想象我们是第一个看到这个的.

I am looking for a gcc command line option to prevent this. I cant imagine we are the first ones to see this.

我当然可以这样做

-march=armv4t


00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   4803        ldr r0, [pc, #12]   ; (10 <fun+0x10>)
   4:   f7ff fffe   bl  0 <more_fun>
   8:   3001        adds    r0, #1
   a:   bc10        pop {r4}
   c:   bc02        pop {r1}
   e:   4708        bx  r1
  10:   12344700    eorsne  r4, r4, #0, 14

预防问题

注意，不限于拇指模式，gcc 也可以使用弹出后的文字池为类似的东西生成 arm 代码.

Note, not limited to thumb mode, gcc can produce arm code as well for something like this with the literal pool after the pop.

unsigned int more_fun ( unsigned int );
unsigned int fun ( void )
{
    return(more_fun(0xe12fff10)+1);
}

00000000 <fun>:
   0:   e92d4010    push    {r4, lr}
   4:   e59f0008    ldr r0, [pc, #8]    ; 14 <fun+0x14>
   8:   ebfffffe    bl  0 <more_fun>
   c:   e2800001    add r0, r0, #1
  10:   e8bd8010    pop {r4, pc}
  14:   e12fff10    bx  r0

希望有人知道一个通用的或 arm 特定的选项来执行 armv4t 之类的 return(例如，pop {r4,lr}；bx lr 在 arm 模式下)没有包袱，或者在 pop pc 之后立即将一个分支放到 self (似乎为了解决这个问题，管道不会混淆 b 作为无条件分支.

Hoping someone knows a generic or arm specific option to do an armv4t like return (pop {r4,lr}; bx lr in arm mode for example) without the baggage or puts a branch to self immediately after a pop pc (seems to solve the problem the pipe is not confused about b as an unconditional branch.

编辑

ldr pc,[something]
bx rn

也会导致预取.这不会属于 -march=armv4t.gcc 故意生成ldrls pc,[];b 某处用于 switch 语句，这很好.没有检查后端是否有其他ldr pc,[]指令生成.

also causes a prefetch. which is not going to fall under -march=armv4t. gcc intentionally generates ldrls pc,[]; b somewhere for switch statements and that is fine. Didnt inspect the backend to see if there are other ldr pc,[] instructions generated.

编辑

看起来 ARM 确实将此报告为勘误表(勘误表 720247，推测指令提取可以在内存映射中的任何地方进行)，希望我在我们花了一个月的时间之前就知道这一点......

Looks like ARM did report this as an Errata (erratum 720247, "Speculative Instruction fetches can be made anywhere in the memory map"), wish I had known that before we spent a month on it...

推荐答案

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html 有一个 -mpure-code 选项，它不会将常量放在代码段中.此选项仅在使用 MOVT 指令为 M-profile 目标生成非 pic 代码时可用."因此它可能会使用一对 mov 立即指令而不是从常量池加载常量.

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html has a -mpure-code option, which doesn't put constants in code sections. "This option is only available when generating non-pic code for M-profile targets with the MOVT instruction." so it probably loads constants with a pair of mov-immediate instructions instead of from a constant-pool.

但这并不能完全解决您的问题，因为使用虚假寄存器内容推测性执行常规指令(在函数内部的条件分支之后)仍可能触发对不可预测地址的访问.或者只是另一个函数的第一条指令可能是一个负载，因此进入另一个函数也并不总是安全的.

This doesn't fully solve your problem though, since speculative execution of regular instructions (after a conditional branch inside a function) with bogus register contents could still trigger access to unpredictable addresses. Or just the first instruction of another function might be a load, so falling through into another function isn't always safe either.

我可以尝试阐明为什么这很模糊以至于编译器还没有避免它.

I can try to shed some light on why this is obscure enough that compilers don't already avoid it.

通常情况下，推测执行有故障的指令不是问题.CPU 不会真正承担错误，直到它变成非推测性的.不正确(或不存在)的分支预测可能会使 CPU 在找出正确路径之前做一些缓慢的事情，但不应该存在正确性问题.

Normally, speculative execution of instructions that fault is not a problem. The CPU doesn't actually take the fault until it becomes non-speculative. Incorrect (or non-existent) branch prediction can make the CPU do something slow before figuring out the right path, but there should never be a correctness problem.

通常，大多数 CPU 设计都允许从内存中进行推测性加载.但是具有 MMIO 寄存器的内存区域显然必须受到保护.例如，在 x86 中，内存区域可以是 WB(正常、可回写缓存、允许推测加载)或 UC(不可缓存、无推测加载).更不用说写组合直写了...

Normally, speculative loads from memory are allowed in most CPU designs. But memory regions with MMIO registers obviously have to be protected from this. In x86 for example, memory regions can be WB (normal, write-back cacheable, speculative loads allowed), or UC (Uncacheable, no speculative loads). Not to mention write-combining write-through...

您可能需要类似的东西来解决您的正确性问题，以阻止推测执行做一些实际上会爆炸的事情.这包括由推测性 bx r0 触发的推测性取指令.(抱歉，我不了解 ARM，所以我无法建议您如何这样做.但这就是为什么对于大多数系统来说这只是一个次要的性能问题，即使它们具有无法推测读取的 MMIO 寄存器.)

You probably need something similar to solve your correctness problem, to stop speculative execution from doing something that will actually explode. This includes speculative instruction-fetch triggered by a speculative bx r0. (Sorry I don't know ARM, so I can't suggest how you'd do that. But this is why it's only a minor performance problem for most systems, even though they have MMIO registers that can't be speculatively read.)

我认为让 CPU 从使系统崩溃的地址进行推测性加载而不是仅在 当/如果它们变得非推测性时引发异常的设置是非常不寻常的.

I think it's very unusual to have a setup that lets the CPU do speculative loads from addresses that crash the system instead of just raising an exception when / if they become non-speculative.

在这种情况下我们关闭了分支预测器

we have the branch predictor off in this case

这可能就是为什么您总是看到超出无条件分支(pop)的推测执行，而不是很少见.

This may be why you're always seeing speculative execution beyond an unconditional branch (the pop), instead of just very rarely.

使用 bx 返回的不错的侦探工作，表明您的 CPU 在解码时检测到这种无条件分支，但不检查 pc 位弹出.:/

Nice detective work with using a bx to return, showing that your CPU detects that kind of unconditional branch at decode, but doesn't check the pc bit in a pop. :/

一般来说，分支预测必须在解码之前发生，以避免获取气泡.给定一个取块地址，预测下一个取块地址.预测也是在指令级别而不是获取块级别生成的，供内核的后期使用(因为一个块中可以有多个分支指令，您需要知道采用哪一个).

In general, branch prediction has to happen before decode, to avoid fetch bubbles. Given the address of a fetch block, predict the next block-fetch address. Predictions are also generated at the instruction level instead of fetch-block level, for use by later stages of the core (because there can be multiple branch instructions in a block, and you need to know which one is taken).

这是一般理论.分支预测不是 100%，因此您不能指望它来解决正确性问题.

x86 CPU 可能存在性能问题，其中间接 jmp [mem] 或 jmp reg 的默认预测是下一条指令.如果推测执行启动了一些缓慢取消的东西(例如某些 CPU 上的 div)或触发缓慢的推测内存访问或 TLB 未命中，一旦确定，它就会延迟正确路径的执行.

x86 CPUs can have performance problems where the default prediction for an indirect jmp [mem] or jmp reg is the next instruction. If speculative execution starts something that's slow to cancel (like div on some CPUs) or triggers a slow speculative memory access or TLB miss, it can delay execution of the correct path once it's determined.

因此建议(优化手册)在 jmp regud2(非法指令)或 int3(调试陷阱)或类似内容>.或者更好的是，将其中一个跳转表目的地放在那里，以便在某些时候跌倒"是正确的预测.(如果 BTB 没有预测，下一条指令是它唯一能做的明智之举.)

So it's recommended (by optimization manuals) to put ud2 (illegal instruction) or int3 (debug trap) or similar after a jmp reg. Or better, put one of the jump-table destinations there so "fall-through" is a correct prediction some of the time. (If the BTB doesn't have a prediction, next-instruction is about the only sane thing it can do.)

不过，x86 通常不会将代码与数据混合在一起，因此对于文字池很常见的架构来说，这更有可能成为一个问题.(但是，在间接分支或错误预测的正常分支之后，仍然可能会推测性地发生来自虚假地址的加载.

x86 doesn't normally mix code with data, though, so this is more likely to be a problem for architectures where literal pools are common. (But loads from bogus addresses can still happen speculatively after indirect branches, or mispredicted normal branches.

例如if(address_good) { call table[address]();} 很容易从错误的地址中错误预测并触发推测性代码获取.但是如果最终的物理地址范围被标记为不可缓存，加载请求将在内存控制器中停止，直到知道它是非推测性的

e.g. if(address_good) { call table[address](); } could easily mispredict and trigger speculative code-fetch from a bad address. But if the eventual physical address range is marked uncacheable, the load request would stop in the memory controller until it was known to be non-speculative

返回指令是一种间接分支，但下一条指令预测不太可能有用.因此，也许 bx lr 会因为推测性失败不太可能有用而停止?

A return instruction is a type of indirect branch, but it's less likely that a next-instruction prediction is useful. So maybe bx lr stalls because speculative fall-through is less likely to be useful?

pop {pc}(又名 LDMIA 来自堆栈指针)在解码阶段没有被检测为分支(如果它没有专门检查 LDMIAcode>pc 位)，或者它被视为通用的间接分支.ld 到 pc 中当然还有其他用例作为非返回分支，因此将其检测为可能的返回需要检查源寄存器编码以及pc 位.

pop {pc} (aka LDMIA from the stack pointer) is either not detected as a branch in the decode stage (if it doesn't specifically check the pc bit), or it's treated as generic indirect branch. There are certainly other use-cases for ld into pc as a non-return branch, so detecting it as a probable return would require checking the source register encoding as well as the pc bit.

也许有一个特殊的(内部隐藏的)返回地址预测器堆栈，当与 bl 配对时，可以帮助每次正确预测 bx lr?x86 这样做是为了预测 call/ret 指令.

Maybe there's a special (internal hidden) return-address predictor stack that helps get bx lr predicted correctly every time, when paired with bl? x86 does this, to predict call/ret instructions.

你测试过 pop {r4, pc} 是否比 pop {r4, lr}/bx lr 更有效?如果 bx lr 的处理不仅仅是为了避免垃圾的推测执行，那么让 gcc 这样做可能更好，而不是让它用 b指令什么的.


Have you tested if pop {r4, pc} is more efficient than pop {r4, lr} / bx lr?  If bx lr is handled specially in more than just avoiding speculative execution of garbage, it might be better to get gcc to do that, instead of having it lead its literal pool with a b instruction or something.

                        这篇关于ARM 预取解决方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

ARM 预取解决方法 [英] ARM prefetch workaround

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

ARM 预取解决方法 [英] ARM prefetch workaround

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭