ARM预取解决方法 [英] ARM prefetch workaround

查看:152
本文介绍了ARM预取解决方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到的情况是某些地址空间很敏感,因为没有人响应该地址,因此您将其读取会崩溃.

I have a situation where some of the address space is sensitive in that you read it you crash as there is nobody there to respond to that address.

pop {r3,pc}
bx r0

   0:   e8bd8008    pop {r3, pc}
   4:   e12fff10    bx  r0

   8:   bd08        pop {r3, pc}
   a:   4700        bx  r0

bx不是由编译器作为指令创建的,而是32位常数的结果,该常数不适合作为单个指令的立即数,因此可以设置pc的相对负载.这基本上是文字池.而且碰巧有类似bx的位.

The bx was not created by the compiler as an instruction, instead it is the result of a 32 bit constant that didnt fit as an immediate in a single instruction so a pc relative load is setup. This is basically the literal pool. And it happens to have bits that resemble a bx.

可以轻松编写测试程序以生成问题.

Can easily write a test program to generate the issue.

unsigned int more_fun ( unsigned int );
unsigned int fun ( void )
{
    return(more_fun(0x12344700)+1);
}

00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   4802        ldr r0, [pc, #8]    ; (c <fun+0xc>)
   4:   f7ff fffe   bl  0 <more_fun>
   8:   3001        adds    r0, #1
   a:   bd10        pop {r4, pc}
   c:   12344700    eorsne  r4, r4, #0, 14

在这种情况下,处理器正在等待弹出(ldm)返回的数据移至下一条指令bx r0,并在r0中的地址处开始预取.挂在ARM上.

What appears to be happening is the processor is waiting on data coming back from the pop (ldm) moves onto the next instruction bx r0 in this case, and starts a prefetch at the address in r0. Which hangs the ARM.

作为人类,我们将pop看作是无条件的分支,但处理器并不会一直通过管道.

As humans we see the pop as an unconditional branch, but the processor does not it keeps going through the pipe.

预取和分支预测并不是什么新鲜事(在这种情况下,我们没有分支预测器),已经有几十年的历史了,并且不限于ARM,而是以PC作为GPR的指令集的数量以及在某种程度上可以处理的指令它不是特殊的.

Prefetching and branch prediction are nothing new (we have the branch predictor off in this case), decades old, and not limited to ARM, but the number of instruction sets that have the PC as GPR and instructions that to some extent treat it as non-special are few.

我正在寻找gcc命令行选项来防止这种情况.我无法想象我们是第一个看到这个的人.

I am looking for a gcc command line option to prevent this. I cant imagine we are the first ones to see this.

我当然可以做到

-march=armv4t


00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   4803        ldr r0, [pc, #12]   ; (10 <fun+0x10>)
   4:   f7ff fffe   bl  0 <more_fun>
   8:   3001        adds    r0, #1
   a:   bc10        pop {r4}
   c:   bc02        pop {r1}
   e:   4708        bx  r1
  10:   12344700    eorsne  r4, r4, #0, 14

预防问题

请注意,不仅限于拇指模式,gcc还可在弹出后使用文字池为此类内容生成手臂代码.

Note, not limited to thumb mode, gcc can produce arm code as well for something like this with the literal pool after the pop.

unsigned int more_fun ( unsigned int );
unsigned int fun ( void )
{
    return(more_fun(0xe12fff10)+1);
}

00000000 <fun>:
   0:   e92d4010    push    {r4, lr}
   4:   e59f0008    ldr r0, [pc, #8]    ; 14 <fun+0x14>
   8:   ebfffffe    bl  0 <more_fun>
   c:   e2800001    add r0, r0, #1
  10:   e8bd8010    pop {r4, pc}
  14:   e12fff10    bx  r0

希望某人知道一个通​​用的或特定于手臂的选项,可以像做return这样的armv4t动作(例如pop {r4,lr};在arm模式下为bx lr)而没有行李,或者在pop pc之后立即将一个分支放到自己身上(似乎为了解决这个问题,管道并没有混淆b是无条件分支.

Hoping someone knows a generic or arm specific option to do an armv4t like return (pop {r4,lr}; bx lr in arm mode for example) without the baggage or puts a branch to self immediately after a pop pc (seems to solve the problem the pipe is not confused about b as an unconditional branch.

编辑

ldr pc,[something]
bx rn

也会引起预取.不会落在-march = armv4t下.gcc故意生成ldrls pc,[];b在某处有switch语句,那很好.没有检查后端以查看是否还有其他ldr pc,[]指令生成.

also causes a prefetch. which is not going to fall under -march=armv4t. gcc intentionally generates ldrls pc,[]; b somewhere for switch statements and that is fine. Didnt inspect the backend to see if there are other ldr pc,[] instructions generated.

编辑

看起来ARM确实将此报告为勘误表(勘误720247,投机性可以在内存映射中的任何位置进行指令提取"(),希望我在花了一个月的时间之前就已经知道了...

Looks like ARM did report this as an Errata (erratum 720247, "Speculative Instruction fetches can be made anywhere in the memory map"), wish I had known that before we spent a month on it...

推荐答案

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html 具有 -mpure-code选项,该选项不会在代码段中放置常量.仅当使用MOVT指令为M轮廓目标生成非图片代码时,此选项才可用."因此它可能使用一对mov立即指令而不是从常量池中加载常量.

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html has a -mpure-code option, which doesn't put constants in code sections. "This option is only available when generating non-pic code for M-profile targets with the MOVT instruction." so it probably loads constants with a pair of mov-immediate instructions instead of from a constant-pool.

但是,这并不能完全解决您的问题,因为带有伪造的寄存器内容的常规指令的推测执行(在函数内部的条件分支之后)仍可能触发对不可预测地址的访问.否则,只是另一个功能的第一条指令可能就是负载,因此陷入另一个功能也不总是安全的.

This doesn't fully solve your problem though, since speculative execution of regular instructions (after a conditional branch inside a function) with bogus register contents could still trigger access to unpredictable addresses. Or just the first instruction of another function might be a load, so falling through into another function isn't always safe either.

我可以尝试阐明为什么它如此晦涩难懂,以至于编译器尚未避免它.

I can try to shed some light on why this is obscure enough that compilers don't already avoid it.

通常,推测执行错误的指令不是问题.在变为非推测性之前,CPU不会真正处理该故障.错误的(或不存在的)分支预测可能会导致CPU在找出正确的路径之前做一些缓慢的事情,但是永远不会存在正确性问题.

Normally, speculative execution of instructions that fault is not a problem. The CPU doesn't actually take the fault until it becomes non-speculative. Incorrect (or non-existent) branch prediction can make the CPU do something slow before figuring out the right path, but there should never be a correctness problem.

通常,大多数CPU设计中都允许来自内存的推测性负载.但是显然必须保护具有MMIO寄存器的存储区.例如,在x86中,内存区域可以是WB(正常,可写回缓存,允许推测性负载)或UC(不可缓存,没有推测性负载).更不用说写合并写操作了……

Normally, speculative loads from memory are allowed in most CPU designs. But memory regions with MMIO registers obviously have to be protected from this. In x86 for example, memory regions can be WB (normal, write-back cacheable, speculative loads allowed), or UC (Uncacheable, no speculative loads). Not to mention write-combining write-through...

您可能需要一些类似的方法来解决您的正确性问题,以阻止推测性执行执行实际上会爆炸的事情.这包括由推测性 bx r0 触发的推测性指令获取.(很抱歉,我不了解ARM,所以我不建议您如何做到这一点.但这就是为什么即使大多数系统具有无法通过推测方式读取的MMIO寄存器,对于大多数系统来说,这只是一个较小的性能问题.)

You probably need something similar to solve your correctness problem, to stop speculative execution from doing something that will actually explode. This includes speculative instruction-fetch triggered by a speculative bx r0. (Sorry I don't know ARM, so I can't suggest how you'd do that. But this is why it's only a minor performance problem for most systems, even though they have MMIO registers that can't be speculatively read.)

我认为有一种设置能让CPU从使系统崩溃的地址中进行推测性负载,而不是仅仅在当它们变为非推测性时引发异常.

I think it's very unusual to have a setup that lets the CPU do speculative loads from addresses that crash the system instead of just raising an exception when / if they become non-speculative.

在这种情况下,我们关闭了分支预测器

we have the branch predictor off in this case

这可能就是为什么您总是 看到投机执行超出无条件分支( pop )的原因,而不是非常罕见的情况.

This may be why you're always seeing speculative execution beyond an unconditional branch (the pop), instead of just very rarely.

使用 bx 返回的很好的侦探工作,表明您的CPU在解码时检测到了这种无条件分支,但没有检查a中的 pc pop .:/

Nice detective work with using a bx to return, showing that your CPU detects that kind of unconditional branch at decode, but doesn't check the pc bit in a pop. :/

通常,分支预测必须在解码之前进行,以避免获取气泡.给定获取块的地址,预测下一个块获取地址.预测也是在指令级别而不是提取块级别生成的,供内核的后续阶段使用(因为一个块中可以有多个分支指令,并且您需要知道采用哪个分支指令).

In general, branch prediction has to happen before decode, to avoid fetch bubbles. Given the address of a fetch block, predict the next block-fetch address. Predictions are also generated at the instruction level instead of fetch-block level, for use by later stages of the core (because there can be multiple branch instructions in a block, and you need to know which one is taken).

这是通用理论.分支预测不是100%,因此您不能指望它来解决您的正确性问题.

x86 CPU可能会出现性能问题,下一条指令是对间接 jmp [mem] jmp reg 的默认预测.如果推测性执行开始的动作被取消得很慢(例如某些CPU上的 div )或触发了缓慢的推测性内存访问或TLB未命中,则一旦确定正确的路径,它就会延迟执行.

x86 CPUs can have performance problems where the default prediction for an indirect jmp [mem] or jmp reg is the next instruction. If speculative execution starts something that's slow to cancel (like div on some CPUs) or triggers a slow speculative memory access or TLB miss, it can delay execution of the correct path once it's determined.

因此(根据优化手册)建议在 jmp reg ud2 (非法指令)或 int3 (调试陷阱)或类似内容>.或更好的方法是,在其中放置一个跳转表目标,以便在某些时候掉线"是正确的预测.(如果BTB没有预言,那么下一条指令就是它唯一可以做的明智的事情.)

So it's recommended (by optimization manuals) to put ud2 (illegal instruction) or int3 (debug trap) or similar after a jmp reg. Or better, put one of the jump-table destinations there so "fall-through" is a correct prediction some of the time. (If the BTB doesn't have a prediction, next-instruction is about the only sane thing it can do.)

x86通常不会将代码与数据混合在一起,因此,对于文字池很常见的体系结构,这更有可能成为问题.(但伪造地址的负载仍然可能在间接分支或错误预测的正常分支之后进行推测性的发生.

x86 doesn't normally mix code with data, though, so this is more likely to be a problem for architectures where literal pools are common. (But loads from bogus addresses can still happen speculatively after indirect branches, or mispredicted normal branches.

例如 if(address_good){调用表[address]();} 可能很容易错误预测并从错误的地址触发推测性代码获取.但是,如果最终的物理地址范围被标记为不可缓存,则加载请求将在内存控制器中停止,直到它被认为是非推测性的

e.g. if(address_good) { call table[address](); } could easily mispredict and trigger speculative code-fetch from a bad address. But if the eventual physical address range is marked uncacheable, the load request would stop in the memory controller until it was known to be non-speculative

返回指令是一种间接分支,但是下一条指令的预测不太可能使用.那么也许 bx lr 停滞了,因为投机性下降不太可能有用?

A return instruction is a type of indirect branch, but it's less likely that a next-instruction prediction is useful. So maybe bx lr stalls because speculative fall-through is less likely to be useful?

pop {pc} (即来自堆栈指针的 LDMIA )作为分支(如果未专门检查 pc 位),否则将其视为通用的间接分支.当然,将 ld 转换为 pc 作为非返回分支还有其他用例,因此将其检测为可能的返回将需要检查源寄存器编码以及 pc 位.

pop {pc} (aka LDMIA from the stack pointer) is either not detected as a branch in the decode stage (if it doesn't specifically check the pc bit), or it's treated as generic indirect branch. There are certainly other use-cases for ld into pc as a non-return branch, so detecting it as a probable return would require checking the source register encoding as well as the pc bit.

也许有一个特殊的(内部隐藏的)返回地址预测变量堆栈,当与 bl 配对时,有助于每次正确地预测 bx lr 吗?x86这样做是为了预测 call / ret 指令.

Maybe there's a special (internal hidden) return-address predictor stack that helps get bx lr predicted correctly every time, when paired with bl? x86 does this, to predict call/ret instructions.

您是否测试过 pop {r4,pc} 是否比 pop {r4,lr} / bx lr 更有效?如果 bx lr 的特殊处理不仅仅是避免推测性地执行垃圾操作,那么最好让gcc这样做,而不是让它使用 b 来引导其文字池.代码>指令之类的东西.

Have you tested if pop {r4, pc} is more efficient than pop {r4, lr} / bx lr? If bx lr is handled specially in more than just avoiding speculative execution of garbage, it might be better to get gcc to do that, instead of having it lead its literal pool with a b instruction or something.

这篇关于ARM预取解决方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆