当前的任何C ++编译器是否会发出"rep movsb/w/d"? [英] Does any of current C++ compilers ever emit "rep movsb/w/d"?

查看:177
本文介绍了当前的任何C ++编译器是否会发出"rep movsb/w/d"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题让我怀疑,当前的现代编译器是否会发出REP MOVSB/W/D指令.

基于此讨论,看来使用REP MOVSB/W/D可能是有益的在当前的CPU上.

但是无论如何尝试,我都无法使任何当前的编译器(GCC 8,Clang 7,MSVC 2017和ICC 18)发出此指令.

对于这个简单的代码,发出REP MOVSB是合理的:

void fn(char *dst, const char *src, int l) {
    for (int i=0; i<l; i++) {
        dst[i] = src[i];
    }
}

但是,编译器会发出未优化的简单字节复制循环或巨大的展开循环(基本上是内联的memmove).是否有任何编译器使用此指令?

解决方案

GCC具有x86调整选项,可以控制字符串操作策略以及何时进行内联与库调用. (请参见 https://gcc.gnu.org/onlinedocs/gcc/x86- Options.html ). -mmemcpy-strategy=strategy 可以使用alg:max_size:dest_align三胞胎,但是蛮力方式是-mstringop-strategy=rep_byte

我必须使用__restrict来使gcc识别memcpy模式,而不是在重叠检查/回退到哑字节循环后仅执行正常的自动矢量化. (有趣的事实:即使使用-mno-sse,gcc -O3也会使用整数寄存器的全宽度自动向量化.因此,如果使用-Os(针对大小进行优化)或-O2(少于完全优化)).

请注意,如果src和dst与dst > src重叠,则结果为 not memmove.相反,您将获得长度= dst-src的重复模式. rep movsb即使在出现重叠的情况下也必须正确实现确切的字节复制语义,因此它仍然有效(但是在当前CPU上速度很慢:我认为微码会退回到字节循环中).

gcc仅通过识别memcpy模式然后选择内联memcpy作为rep movsb进入rep movsb.它并不直接从字节复制循环转到rep movsb ,这就是为什么可能的别名使优化失败的原因. (尽管-Os考虑直接使用rep movs可能很有趣,但是,当别名分析无法证明它是memcpy或memmove时,在具有快速rep movsb的CPU上.)

 void fn(char *__restrict dst, const char *__restrict src, int l) {
    for (int i=0; i<l; i++) {
        dst[i] = src[i];
    }
}
 

这可能不应该计数",因为我可能会针对用编译器使用rep movs"以外的任何用例推荐这些调整选项,因此不是我没有检查所有的-mtune=silvermont/-mtune=skylake/-mtune=bdver2(Bulldozer版本2 =打桩机)/等等调整选项,但是我怀疑它们是否启用了该选项.因此,这是不切实际的测试,因为没有人使用-march=native会获得此代码源.

但是上面的C会编译在Godbolt编译器资源管理器上使用gcc8.1 -xc -O3 -Wall -mstringop-strategy=rep_byte -minline-all-stringops 到x86-64 System V的该asm:

fn:
        test    edx, edx
        jle     .L1               # rep movs treats the counter as unsigned, but the source uses signed
        sub     edx, 1            # what the heck, gcc?  mov ecx,edx would be too easy?
        lea     ecx, [rdx+1]

        rep movsb                 # dst=rdi and src=rsi
.L1:                              # matching the calling convention
        ret

有趣的事实:为内联rep movs而优化的x86-64 SysV调用约定不是巧合( http://agner.org/optimize/,以及x86标签Wiki .

(我怀疑,如果您进行了dst=__builtin_assume_aligned(dst, 64);或其他任何方式将对齐方式传达给编译器的消息,gcc会做任何不同的事情,例如,某些数组上的alignas(64).)

英特尔的IceLake微体系结构将具有短重复"功能,该功能可以减少rep movs/rep stos的启动开销,从而使它们在小批量生产时更加有用. (当前rep字符串微代码具有很大的启动开销: REP可以进行哪些设置? )


记忆/记忆策略:

顺便说一句,glibc的memcpy对不敏感的小输入使用了一种很好的策略:两个负载->两个可能重叠的存储区,最多可复制2个寄存器.例如,这意味着来自4..7字节的任何输入都以相同的方式分支.

Glibc的汇编源代码中有一个很好的评论,描述了该策略:>针对mcpy的增强型REP MOVSB ,其中有很多关于x86内存带宽的内容线程与所有内核相比,避免使用RFO的NT存储,以及rep movs使用避免RFO的缓存协议...).

在较旧的CPU上,rep movsb对于大型副本而言并不理想,因此推荐的策略是rep movsdmovsq,并为后几次计数进行了特殊处理. (假设您将要完全使用rep movs,例如在无法触摸SIMD向量寄存器的内核代码中.)

对于在L1d或L2高速缓存中很热的中型副本,使用整数寄存器的-mno-sse自动矢量化比rep movs差很多,因此gcc在检查重叠后肯定应该使用rep movsbrep movsq,不是qword复制循环,除非它期望较小的输入(例如64字节)是常见的.


字节循环的唯一优点是代码量小;它几乎是桶的底部;对于较小但未知的副本大小,像glibc这样的智能策略会更好.但是,要内联的代码太多了,函数调用的确要付出一定的代价(堆满调用成簇的寄存器并破坏红色区域,再加上call/ret指令和动态链接间接的实际成本).

尤其是在不经常运行的冷"功能中(因此,您不想在其上花费很多代码大小,从而增加了程序的I缓存占用空间,TLB局部性以及要从磁盘加载的页面, 等等).如果用手工编写asm,通常会对预期的大小分布有更多了解,并且能够内联快速路径并回退到其他路径.

请记住,编译器将在一个程序中可能发生的许多循环中做出决定,并且大多数程序中的大多数代码都在热循环之外. 这就是为什么gcc默认为-fno-unroll-loops的原因,除非启用了配置文件引导的优化. (不过,自动矢量化功能已在-O3上启用,并且可以为像这样的一些小循环创建大量代码.gcc在循环序言/结尾中花费大量代码大小,但是数量很少,这是很愚蠢的.在实际的循环上;尽管如此,它仍然知道每次运行外部代码时,循环将运行数百万次迭代.)

不幸的是,这并不像gcc的自动矢量化代码非常有效或紧凑.在16字节SSE情况下,它在循环清理代码上花费了大量代码大小(完全展开15字节副本).使用32字节的AVX向量,我们得到一个汇总的字节 loop 来处理剩余的元素. (对于一个17字节的副本,与1个XMM向量+ 1字节或glibc样式重叠16字节的副本相比,这是非常糟糕的).在gcc7和更早的版本中,它会像循环序言一样执行完全的展开操作,直到对齐边界为止,因此它的膨胀程度是原来的两倍.

IDK(如果配置文件引导的优化会在此处优化gcc的策略),例如当每次调用的计数都较小时,倾向于使用较小/较简单的代码,因此将无法实现自动向量化的代码.如果代码是冷的"并且在整个程序的每次运行中只能运行一次或完全不运行,则可以更改策略.或者,如果计数通常为16或24左右,那么最后n % 32个字节的标量会很糟糕,因此理想情况下,PGO会将其设为特殊情况下较小的计数. (但我不太乐观.)

我可能会为此报告一个GCC缺少优化的错误,该错误涉及在重叠检查之后检测memcpy,而不是将其纯粹留给自动矢量化程序处理.和/或关于将rep movs用作-Os,如果有更多关于该uarch的信息可用,也许可以使用-mtune=icelake.

许多软件仅使用-O2进行编译,因此除自动矢量化程序之外的rep movs的窥孔可能会有所作为. (但是问题是,这是正面的还是负面的差异!)!

This question made me wonder, if current modern compilers ever emit REP MOVSB/W/D instruction.

Based on this discussion, it seems that using REP MOVSB/W/D could be beneficial on current CPUs.

But no matter how I tried, I cannot made any of the current compilers (GCC 8, Clang 7, MSVC 2017 and ICC 18) to emit this instruction.

For this simple code, it could be reasonable to emit REP MOVSB:

void fn(char *dst, const char *src, int l) {
    for (int i=0; i<l; i++) {
        dst[i] = src[i];
    }
}

But compilers emit a non-optimized simple byte-copy loop, or a huge unrolled loop (basically an inlined memmove). Do any of the compilers use this instruction?

解决方案

GCC has x86 tuning options to control string-ops strategy and when to inline vs. library call. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). -mmemcpy-strategy=strategy takes alg:max_size:dest_align triplets, but the brute-force way is -mstringop-strategy=rep_byte

I had to use __restrict to get gcc to recognize the memcpy pattern, instead of just doing normal auto-vectorization after an overlap check / fallback to a dumb byte loop. (Fun fact: gcc -O3 auto-vectorizes even with -mno-sse, using the full width of an integer register. So you only get a dumb byte loop if you compile with -Os (optimize for size) or -O2 (less than full optimization)).

Note that if src and dst overlap with dst > src, the result is not memmove. Instead, you'll get a repeating pattern with length = dst-src. rep movsb has to correctly implement the exact byte-copy semantics even in case of overlap, so it would still be valid (but slow on current CPUs: I think microcode would just fall back to a byte loop).

gcc only gets to rep movsb via recognizing a memcpy pattern and then choosing to inline memcpy as rep movsb. It doesn't go directly from byte-copy loop to rep movsb, and that's why possible aliasing defeats the optimization. (It might be interesting for -Os to consider using rep movs directly, though, when alias analysis can't prove it's a memcpy or memmove, on CPUs with fast rep movsb.)

void fn(char *__restrict dst, const char *__restrict src, int l) {
    for (int i=0; i<l; i++) {
        dst[i] = src[i];
    }
}

This probably shouldn't "count" because I would probably not recommend those tuning options for any use-case other than "make the compiler use rep movs", so it's not that different from an intrinsic. I didn't check all the -mtune=silvermont / -mtune=skylake / -mtune=bdver2 (Bulldozer version 2 = Piledriver) / etc. tuning options, but I doubt any of them enable that. So this is an unrealistic test because nobody using -march=native would get this code-gen.

But the above C compiles with gcc8.1 -xc -O3 -Wall -mstringop-strategy=rep_byte -minline-all-stringops on the Godbolt compiler explorer to this asm for x86-64 System V:

fn:
        test    edx, edx
        jle     .L1               # rep movs treats the counter as unsigned, but the source uses signed
        sub     edx, 1            # what the heck, gcc?  mov ecx,edx would be too easy?
        lea     ecx, [rdx+1]

        rep movsb                 # dst=rdi and src=rsi
.L1:                              # matching the calling convention
        ret

Fun fact: the x86-64 SysV calling convention being optimized for inlining rep movs is not a coincidence (Why does Windows64 use a different calling convention from all other OSes on x86-64?). I think gcc favoured that when the calling convention was being designed, so it saved instructions.

rep_8byte does a bunch of setup to handle counts that aren't a multiple of 8, and maybe alignment, I didn't look carefully.

I also didn't check other compilers.


Inlining rep movsb would be a poor choice without an alignment guarantee, so it's good that compilers don't do it by default. (As long as they do something better.) Intel's optimization manual has a section on memcpy and memset with SIMD vectors vs. rep movs. See also http://agner.org/optimize/, and other performance links in the x86 tag wiki.

(I doubt that gcc would do anything differently if you did dst=__builtin_assume_aligned(dst, 64); or any other way of communicating alignment to the compiler, though. e.g. alignas(64) on some arrays.)

Intel's IceLake microarchitecture will have a "short rep" feature that presumably reduces startup overhead for rep movs / rep stos, making them much more useful for small counts. (Currently rep string microcode has significant startup overhead: What setup does REP do?)


memmove / memcpy strategies:

BTW, glibc's memcpy uses a pretty nice strategy for small inputs that's insensitive to overlap: Two loads -> two stores that potentially overlap, for copies up to 2 registers wide. This means any input from 4..7 bytes branches the same way, for example.

Glibc's asm source has a nice comment describing the strategy: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19.

For large inputs, it uses SSE XMM registers, AVX YMM registers, or rep movsb (after checking an internal config variable that's set based on CPU-detection when glibc initializes itself). I'm not sure which CPUs it will actually use rep movsb on, if any, but support is there for using it for large copies.


rep movsb might well be a pretty reasonable choice for small code-size and non-terrible scaling with count for a byte loop like this, with safe handling for the unlikely case of overlap.

Microcode startup overhead is a big problem with using it for copies that are usually small, though, on current CPUs.

It's probably better than a byte loop if the average copy size is maybe 8 to 16 bytes on current CPUs, and/or different counts cause branch mispredicts a lot. It's not good, but it's less bad.

Some kind of last-ditch peephole optimization for turning a byte-loop into a rep movsb might be a good idea, if compiling without auto-vectorization. (Or for compilers like MSVC that make a byte loop even at full optimization.)

It would be neat if compilers knew about it more directly, and considered using it for -Os (optimize for code-size more than speed) when tuning for CPUs with the Enhanced Rep Movs/Stos Byte (ERMSB) feature. (See also Enhanced REP MOVSB for memcpy for lots of good stuff about x86 memory bandwidth single threaded vs. all cores, NT stores that avoid RFO, and rep movs using an RFO-avoiding cache protocol...).

On older CPUs, rep movsb wasn't as good for large copies, so the recommended strategy was rep movsd or movsq with special handling for the last few counts. (Assuming you're going to use rep movs at all, e.g. in kernel code where you can't touch SIMD vector registers.)

The -mno-sse auto-vectorization using integer registers is much worse than rep movs for medium sized copies that are hot in L1d or L2 cache, so gcc should definitely use rep movsb or rep movsq after checking for overlap, not a qword copy loop, unless it expects small inputs (like 64 bytes) to be common.


The only advantage of a byte loop is small code size; it's pretty much the bottom of the barrel; a smart strategy like glibc's would be much better for small but unknown copy sizes. But that's too much code to inline, and a function call does have some cost (spilling call-clobbered registers and clobbering the red zone, plus the actual cost of the call / ret instructions and dynamic linking indirection).

Especially in a "cold" function that doesn't run often (so you don't want to spend a lot of code size on it, increasing your program's I-cache footprint, TLB locality, pages to be loaded from disk, etc). If writing asm by hand, you'd usually know more about the expected size distribution and be able to inline a fast-path with a fallback to something else.

Remember that compilers will make their decisions on potentially many loops in one program, and most code in most programs is outside of hot loops. It shouldn't bloat them all. This is why gcc defaults to -fno-unroll-loops unless profile-guided optimization is enabled. (Auto-vectorization is enabled at -O3, though, and can create a huge amount of code for some small loops like this one. It's quite silly that gcc spends huge amounts of code-size on loop prologues/epilogues, but tiny amounts on the actual loop; for all it knows the loop will run millions of iterations for each one time the code outside runs.)

Unfortunately it's not like gcc's auto-vectorized code is very efficient or compact. It spends a lot of code size on the loop cleanup code for the 16-byte SSE case (fully unrolling 15 byte-copies). With 32-byte AVX vectors, we get a rolled-up byte loop to handle the leftover elements. (For a 17 byte copy, this is pretty terrible vs. 1 XMM vector + 1 byte or glibc style overlapping 16-byte copies). With gcc7 and earlier, it does the same full unrolling until an alignment boundary as a loop prologue so it's twice as bloated.

IDK if profile-guided optimization would optimize gcc's strategy here, e.g. favouring smaller / simpler code when the count is small on every call, so auto-vectorized code wouldn't be reached. Or change strategy if the code is "cold" and only runs once or not at all per run of the whole program. Or if the count is usually 16 or 24 or something, then scalar for the last n % 32 bytes is terrible so ideally PGO would get it to special case smaller counts. (But I'm not too optimistic.)

I might report a GCC missed-optimization bug for this, about detecting memcpy after an overlap check instead of leaving it purely up to the auto-vectorizer. And/or about using rep movs for -Os, maybe with -mtune=icelake if more info becomes available about that uarch.

A lot of software gets compiled with only -O2, so a peephole for rep movs other than the auto-vectorizer could make a difference. (But the question is whether it's a positive or negative difference)!

这篇关于当前的任何C ++编译器是否会发出"rep movsb/w/d"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆