CMOVcc是否被视为分支指令？ [英] Is CMOVcc considered a branching instruction?

查看：176 发布时间：2020/10/11 0:09:41 assembly x86-64 cpu-architecture micro-optimization branch-prediction

本文介绍了CMOVcc是否被视为分支指令？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这个 memchr 代码，我想使其不分支：

  .globl memchr 
 memchr：
 mov％rdx，％rcx 
 mov％sil，％al 
 cld 
 repne scasb 
 lea -1（％rdi），％rax 
测试％rcx，％rcx 
 cmove％rcx，％rax 
 ret

我不确定 cmove 是否是分支指令。是吗？如果是这样，我该如何重新排列我的代码，使其不分支？

解决方案

不，它不是分支，那是 cmovcc 的全部点。

这是一个ALU选择，对两个数据均具有数据依赖性输入，而不是控件依赖项。（使用内存源，它会无条件地加载内存源，这与ARM谓词加载指令是真正的NOPed不同，因此您不能将其与可能不良的指针一起用于无分支范围或NULL检查。这也许是最清楚的说明，它绝对不是

但是无论如何，它不是以任何方式预测或推测的；就CPU调度程序而言，它就像一条 adc 指令：2个整数输入+ FLAGS和1个整数输出。（与 adc / sbb 唯一的区别是它不写FLAGS。当然，它在执行单元上运行

是好还是坏完全取决于用例。另请参见 gcc优化标志-O3使代码的速度慢于-O2 有关 cmov 上行/下行

的更多信息请注意， repne scasb 并不快。快速字符串仅适用于rep stos / movs。

repne scasb 在现代CPU上每个时钟周期运行大约1个计数，即通常比简单的SSE2 pcmpeqb / pmovmskb / test + jnz 循环。借助巧妙的优化，您甚至可以更快地运行，每个时钟最多2个向量可使加载端口饱和。

 
 
 （例如，请参见glibc的 memchr 用于将整个缓存行的 pcmpeqb 结果进行或运算，以馈送一个 pmovmskb  IIRC，然后返回并找出实际命中的位置。）
 
 
   repne scasb 也有启动开销，但是微代码分支是不同的来自常规分支：在Intel CPU上不是分支预测的。因此，这不会造成错误的预测，但是对于除了很小的缓冲区之外的任何东西，都是对性能的总浪费。

SSE2是x86-64和有效未对齐负载+ <$ c的基线$ c> pmovmskb 使得 memchr 毫无疑问，在这里您可以检查长度> = 16，以避免进入未映射的页面。 / p>

快转：

为什么在启用优化的情况下，此代码慢6.5倍？ / a>显示了使用SSE2进行16字节对齐输入的简单未展开strlen。

为什么glibc的需求太复杂才能快速运行？链接到更多有关手动优化的glibc中的asm strlen函数。（以及如何使Ghack C中的bithack避免严格使用UB。）

https ：//codereview.stackexchange.com/a/213558 标量bithack泛滥，其中包括与glibc问题有关的相同的一次4字节bithack。优于一次字节读取，但对于SSE2（x86-64保证）毫无意义。但是，@ CodyGray的教程式答案可能对初学者很有用。请注意，它没有考虑

I have this memchr code that I'm trying to make non-branching:

.globl memchr
memchr:
        mov %rdx, %rcx
        mov %sil, %al
        cld
        repne scasb
        lea -1(%rdi), %rax
        test %rcx, %rcx
        cmove %rcx, %rax
        ret

I'm unsure whether or not cmove is a branching instruction. Is it? If so, how do I rearrange my code so it doesn't branch?

解决方案

No, it's not a branch, that's the whole point of cmovcc.

It's an ALU select that has a data dependency on both inputs, not a control dependency. (With a memory source, it unconditionally loads the memory source, unlike ARM predicated load instructions that are truly NOPed. So you can't use it with maybe-bad pointers for branchless bounds or NULL checks. That's maybe the clearest illustration that it's definitely not a branch.)

But anyway, it's not predicted or speculated in any way; as far as the CPU scheduler is concerned it's just like an adc instruction: 2 integer inputs + FLAGS, and 1 integer output. (Only difference from adc/sbb is that it doesn't write FLAGS. And of course runs on an execution unit with different internals).

Whether that's good or bad entirely depends on the use-case. See also gcc optimization flag -O3 makes code slower than -O2 for much more about cmov upside / downside

Note that repne scasb is not fast. "Fast Strings" only works for rep stos / movs.

repne scasb runs about 1 count per clock cycle on modern CPUs, i.e. typically about 16x worse than a simple SSE2 pcmpeqb/pmovmskb/test+jnz loop. And with clever optimization you can go even faster, up to 2 vectors per clock saturating the load ports.

(e.g. see glibc's memchr for ORing pcmpeqb results for a whole cache line together to feed one pmovmskb, IIRC. Then go back and sort out where the actual hit was.)

repne scasb also has startup overhead, but microcode branching is different from regular branching: it's not branch-predicted on Intel CPUs. So this can't mispredict, but is total garbage for performance with anything but very small buffers.

SSE2 is baseline for x86-64 and efficient unaligned loads + pmovmskb make it a no-brainer for memchr where you can check for length >= 16 to avoid crossing into an unmapped page.

Fast strlen:

Why is this code 6.5x slower with optimizations enabled? shows a simple not-unrolled strlen for 16-byte-aligned inputs using SSE2.
Why does glibc's strlen need to be so complicated to run quickly? links to some more stuff about hand-optimized asm strlen functions in glibc. (And how to make a bithack strlen in GNU C avoid strict-aliasing UB.)
https://codereview.stackexchange.com/a/213558 scalar bithack strlen, including the same 4-byte-at-a-time bithack that the glibc question was about. Better than byte-at-a-time but pointless with SSE2 (which x86-64 guarantees). However, @CodyGray's tutorial-style answer may be a useful for beginners. Note that it doesn't take into account Is it safe to read past the end of a buffer within the same page on x86 and x64?

这篇关于CMOVcc是否被视为分支指令？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

CMOVcc是否被视为分支指令？ [英] Is CMOVcc considered a branching instruction?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

CMOVcc是否被视为分支指令？ [英] Is CMOVcc considered a branching instruction?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭