CMOVcc是否被视为分支指令? [英] Is CMOVcc considered a branching instruction?

查看:176
本文介绍了CMOVcc是否被视为分支指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个 memchr 代码,我想使其不分支:

  .globl memchr 
memchr:
mov%rdx,%rcx
mov%sil,%al
cld
repne scasb
lea -1(%rdi),%r​​ax
测试%rcx,%rcx
cmove%rcx,%rax
ret

我不确定 cmove 是否是分支指令。是吗?如果是这样,我该如何重新排列我的代码,使其不分支?

解决方案

不,它不是分支,那是 cmovcc 的全部点。



这是一个ALU选择,对两个数据均具有数据依赖性输入,而不是控件依赖项。 (使用内存源,它会无条件地加载内存源,这与ARM谓词加载指令是真正的NOPed不同,因此您不能将其与可能不良的指针一起用于无分支范围或NULL检查。这也许是最清楚的说明,它绝对不是



但是无论如何,它不是以任何方式预测或推测的;就CPU调度程序而言,它就像一条 adc 指令:2个整数输入+ FLAGS和1个整数输出。 (与 adc / sbb 唯一的区别是它不写FLAGS。当然,它在执行单元上运行



是好还是坏完全取决于用例。另请参见 gcc优化标志-O3使代码的速度慢于-O2 有关 cmov 上行/下行






的更多信息请注意, repne scasb 并不快。快速字符串仅适用于rep stos / movs。



repne scasb 在现代CPU上每个时钟周期运行大约1个计数,即通常比简单的SSE2 pcmpeqb / pmovmskb / test + jnz 循环。借助巧妙的优化,您甚至可以更快地运行,每个时钟最多2个向量可使加载端口饱和。



(例如,请参见glibc的 memchr 用于将整个缓存行的 pcmpeqb 结果进行或运算,以馈送一个 pmovmskb IIRC,然后返回并找出实际命中的位置。)



repne scasb 也有启动开销,但是微代码分支是不同的来自常规分支:在Intel CPU上不是分支预测的。因此,这不会造成错误的预测,但是对于除了很小的缓冲区之外的任何东西,都是对性能的总浪费。



SSE2是x86-64和有效未对齐负载+ <$ c的基线$ c> pmovmskb 使得 memchr 毫无疑问,在这里您可以检查长度> = 16,以避免进入未映射的页面。 / p>

快转




I have this memchr code that I'm trying to make non-branching:

.globl memchr
memchr:
        mov %rdx, %rcx
        mov %sil, %al
        cld
        repne scasb
        lea -1(%rdi), %rax
        test %rcx, %rcx
        cmove %rcx, %rax
        ret

I'm unsure whether or not cmove is a branching instruction. Is it? If so, how do I rearrange my code so it doesn't branch?

解决方案

No, it's not a branch, that's the whole point of cmovcc.

It's an ALU select that has a data dependency on both inputs, not a control dependency. (With a memory source, it unconditionally loads the memory source, unlike ARM predicated load instructions that are truly NOPed. So you can't use it with maybe-bad pointers for branchless bounds or NULL checks. That's maybe the clearest illustration that it's definitely not a branch.)

But anyway, it's not predicted or speculated in any way; as far as the CPU scheduler is concerned it's just like an adc instruction: 2 integer inputs + FLAGS, and 1 integer output. (Only difference from adc/sbb is that it doesn't write FLAGS. And of course runs on an execution unit with different internals).

Whether that's good or bad entirely depends on the use-case. See also gcc optimization flag -O3 makes code slower than -O2 for much more about cmov upside / downside


Note that repne scasb is not fast. "Fast Strings" only works for rep stos / movs.

repne scasb runs about 1 count per clock cycle on modern CPUs, i.e. typically about 16x worse than a simple SSE2 pcmpeqb/pmovmskb/test+jnz loop. And with clever optimization you can go even faster, up to 2 vectors per clock saturating the load ports.

(e.g. see glibc's memchr for ORing pcmpeqb results for a whole cache line together to feed one pmovmskb, IIRC. Then go back and sort out where the actual hit was.)

repne scasb also has startup overhead, but microcode branching is different from regular branching: it's not branch-predicted on Intel CPUs. So this can't mispredict, but is total garbage for performance with anything but very small buffers.

SSE2 is baseline for x86-64 and efficient unaligned loads + pmovmskb make it a no-brainer for memchr where you can check for length >= 16 to avoid crossing into an unmapped page.

Fast strlen:

这篇关于CMOVcc是否被视为分支指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆