CMOVcc是否被视为分支指令? [英] Is CMOVcc considered a branching instruction?
问题描述
我有这个 memchr
代码,我想使其不分支:
.globl memchr
memchr:
mov%rdx,%rcx
mov%sil,%al
cld
repne scasb
lea -1(%rdi),%rax
测试%rcx,%rcx
cmove%rcx,%rax
ret
我不确定 cmove
是否是分支指令。是吗?如果是这样,我该如何重新排列我的代码,使其不分支?
不,它不是分支,那是 cmovcc
的全部点。
这是一个ALU选择,对两个数据均具有数据依赖性输入,而不是控件依赖项。 (使用内存源,它会无条件地加载内存源,这与ARM谓词加载指令是真正的NOPed不同,因此您不能将其与可能不良的指针一起用于无分支范围或NULL检查。这也许是最清楚的说明,它绝对不是
但是无论如何,它不是以任何方式预测或推测的;就CPU调度程序而言,它就像一条 adc
指令:2个整数输入+ FLAGS和1个整数输出。 (与 adc
/ sbb
唯一的区别是它不写FLAGS。当然,它在执行单元上运行
是好还是坏完全取决于用例。另请参见 gcc优化标志-O3使代码的速度慢于-O2 有关 cmov
上行/下行
的更多信息请注意, repne scasb
并不快。快速字符串仅适用于rep stos / movs。
repne scasb
在现代CPU上每个时钟周期运行大约1个计数,即通常比简单的SSE2 pcmpeqb $差16倍。 c $ c> /
pmovmskb
/ test + jnz
循环。借助巧妙的优化,您甚至可以更快地运行,每个时钟最多2个向量可使加载端口饱和。
(例如,请参见glibc的 memchr
用于将整个缓存行的 pcmpeqb
结果进行或运算,以馈送一个 pmovmskb
IIRC,然后返回并找出实际命中的位置。)
repne scasb
也有启动开销,但是微代码分支是不同的来自常规分支:在Intel CPU上不是分支预测的。因此,这不会造成错误的预测,但是对于除了很小的缓冲区之外的任何东西,都是对性能的总浪费。
SSE2是x86-64和有效未对齐负载+ <$ c的基线$ c> pmovmskb 使得 memchr
毫无疑问,在这里您可以检查长度> = 16,以避免进入未映射的页面。 / p>
快转:
- 为什么在启用优化的情况下,此代码慢6.5倍? / a>显示了使用SSE2进行16字节对齐输入的简单未展开strlen。
- 为什么glibc的需求太复杂才能快速运行?链接到更多有关手动优化的glibc中的asm strlen函数。 (以及如何使Ghack C中的bithack避免严格使用UB。)
- https ://codereview.stackexchange.com/a/213558 标量bithack泛滥,其中包括与glibc问题有关的相同的一次4字节bithack。优于一次字节读取,但对于SSE2(x86-64保证)毫无意义。但是,@ CodyGray的教程式答案可能对初学者很有用。请注意,它没有考虑
I have this memchr
code that I'm trying to make non-branching:
.globl memchr
memchr:
mov %rdx, %rcx
mov %sil, %al
cld
repne scasb
lea -1(%rdi), %rax
test %rcx, %rcx
cmove %rcx, %rax
ret
I'm unsure whether or not cmove
is a branching instruction. Is it? If so, how do I rearrange my code so it doesn't branch?
No, it's not a branch, that's the whole point of cmovcc
.
It's an ALU select that has a data dependency on both inputs, not a control dependency. (With a memory source, it unconditionally loads the memory source, unlike ARM predicated load instructions that are truly NOPed. So you can't use it with maybe-bad pointers for branchless bounds or NULL checks. That's maybe the clearest illustration that it's definitely not a branch.)
But anyway, it's not predicted or speculated in any way; as far as the CPU scheduler is concerned it's just like an adc
instruction: 2 integer inputs + FLAGS, and 1 integer output. (Only difference from adc
/sbb
is that it doesn't write FLAGS. And of course runs on an execution unit with different internals).
Whether that's good or bad entirely depends on the use-case. See also gcc optimization flag -O3 makes code slower than -O2 for much more about cmov
upside / downside
Note that repne scasb
is not fast. "Fast Strings" only works for rep stos / movs.
repne scasb
runs about 1 count per clock cycle on modern CPUs, i.e. typically about 16x worse than a simple SSE2 pcmpeqb
/pmovmskb
/test+jnz
loop. And with clever optimization you can go even faster, up to 2 vectors per clock saturating the load ports.
(e.g. see glibc's memchr
for ORing pcmpeqb
results for a whole cache line together to feed one pmovmskb
, IIRC. Then go back and sort out where the actual hit was.)
repne scasb
also has startup overhead, but microcode branching is different from regular branching: it's not branch-predicted on Intel CPUs. So this can't mispredict, but is total garbage for performance with anything but very small buffers.
SSE2 is baseline for x86-64 and efficient unaligned loads + pmovmskb
make it a no-brainer for memchr
where you can check for length >= 16 to avoid crossing into an unmapped page.
Fast strlen:
- Why is this code 6.5x slower with optimizations enabled? shows a simple not-unrolled strlen for 16-byte-aligned inputs using SSE2.
- Why does glibc's strlen need to be so complicated to run quickly? links to some more stuff about hand-optimized asm strlen functions in glibc. (And how to make a bithack strlen in GNU C avoid strict-aliasing UB.)
- https://codereview.stackexchange.com/a/213558 scalar bithack strlen, including the same 4-byte-at-a-time bithack that the glibc question was about. Better than byte-at-a-time but pointless with SSE2 (which x86-64 guarantees). However, @CodyGray's tutorial-style answer may be a useful for beginners. Note that it doesn't take into account Is it safe to read past the end of a buffer within the same page on x86 and x64?
这篇关于CMOVcc是否被视为分支指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!