repz ret：为什么所有麻烦？ [英] repz ret: why all the hassle?

查看：146 发布时间：2020/6/5 18:35:19 assembly x86 micro-optimization amd-processor branch-prediction

本文介绍了repz ret：为什么所有麻烦？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

repz ret 的问题已在此处[ 1 ]以及其他来源[ 2 ，< a href = http://pages.cs.wisc.edu/~lena/repzret.php rel = nofollow noreferrer> 3 ]非常令人满意。但是，在没有阅读这两个资料的情况下，我找到了以下答案：

The issue of the repz ret has been covered here [1] as well as in other sources [2, 3] quite satisfactorily. However, reading neither of these sources, I found answers to the following:

什么是 actual >与 ret 或 nop进行定量比较的惩罚； ret ？尤其是在后一种情况下，当大多数函数具有100条以上指令或被内联时，解码一条额外的指令（然后是一条空指令！）真的有意义吗？

What is the actual penalty in a quantitative comparison with ret or nop; ret? Especially in the latter case – is decoding one extra instruction (and an empty one at that!) really relevant, when most functions either have 100+ of those or get inlined?

为什么这在AMD K8中从来没有得到解决，甚至没有进入K10？因为什么时候基于行为来记录丑陋的解决方法而 stays 未记录，而不是真正解决问题，所以当原因的每个细节都已知时？

Why did this never get fixed in AMD K8, and even made its way into K10? Since when is documenting an ugly workaround based on a behaviour that is and stays undocumented preferred to actually fixing the issue, when every detail of the cause is known?

推荐答案

分支预测错误

造成所有麻烦的原因是分支错误预测的成本。

当分支周围出现时，CPU会预测已采取的分支并将这些指令预加载到管道中。

如果预测错误，则需要清除管道并

这可能最多需要 number_of_stages_in_pipeline 个周期，再加上从缓存中加载数据所需的任何周期。通常每个错误预测需要14到25个周期。

Branch misprediction
The reason for all the hoopla is the cost of branch mispredictions.
When a branch comes around the CPU predicts the branch taken and preloads these instructions in the pipeline.
If the prediction is wrong the pipeline needs to be cleared and new instructions loaded.
This can take up to number_of_stages_in_pipeline cycles plus any cycles needed to load the data from the cache. 14 to 25 cycles per misprediction is typical.

原因：处理器设计

K8和K10遭受此故障的原因

AMD K8和K10将在高速缓存中对指令进行预解码，并在CPU L1指令高速缓存中跟踪其长度。

Reason: processor design
The reason K8 and K10 suffer from this is because of a nifty optimization by AMD.
AMD K8 and K10 will pre-decode instructions in the cache and keep track of their length in the CPU L1 instruction cache.
In order to do this it has extra bits.

每条128位（16个字节）的指令将存储76位附加数据。

下表对此进行了详细说明：

The following table details this:

Data             Size       Notes
-------------------------------------------------------------------------
Instructions     128 bits   The data as read from memory
Parity bits      8 bits     One parity bit for every 16 bits
Pre-decode       56 bits    3 bits per byte (start, end, function) 
                            + 4 bit per 16 byte line
Branch selectors 16 bits    2 bits for each 2 bytes of instruction code

Total            204 bits   128 instructions, 76 metadata

因为所有这些数据将其存储在L1指令高速缓存中，K8 / 10 cpu在解码和分支预测上的工作量要少得多。这样可以节省芯片上的成本。

而且由于AMD的晶体管预算不如英特尔大，因此需要更智能地工作。

Because all this data is stored in the L1 instruction cache the K8/10 cpu has to spend a lot less work on decode and branch prediction. This saves on silicon.
And because AMD does not have as big a transistor's budget as Intel it needs to work smarter.

但是如果代码是esp。紧紧的跳动和ret可能会占用相同的两个字节槽，这意味着 RET 会被预测为未采取（因为紧随其后的是跳动）。

通过使RET占据两个字节 REP RET ，这将永远不会发生，并且始终可以预测为RET。

However if the code is esp. tight a jump and a ret might occupy the same two byte slot, meaning that there the RET gets predicted as NOT taken (because the jump following it is).
By making the RET occupy two bytes REP RET this can never occur and a RET will always be predicted OK.

英特尔没有这个问题，但是（曾经）遭受了有限数量的预测插槽的困扰，而AMD则没有。

Intel does not have this problem, but (used to) suffer(s) from a limited number of prediction slots, which AMD does not.

nop ret

没有理由做 nop ret 。这是两条指令，浪费了执行 nop 的额外周期，而 ret 可能仍然与跳跃配对。

如果要对齐，请使用 REP MOV 或使用多字节nop 。

nop ret
There is never a reason to do nop ret. This is two instructions wasting an extra cycle to execute the nop and the ret might still 'pair' with a jump.
If you want to align use a REP MOV instead or use a multibyte nop.

结束语

仅本地分支预测与指令一起存储在高速缓存中。

也有一个单独的全局分支预测表。

Closing remarks
Only the local branch prediction is stored with instructions in the cache.
There is a separate Global branch prediction table as well.

这篇关于repz ret：为什么所有麻烦？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

repz ret：为什么所有麻烦？ [英] repz ret: why all the hassle?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

repz ret：为什么所有麻烦？ [英] repz ret: why all the hassle?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭