为什么gcc不能将_mm256_loadu_pd解析为单个vmovupd? [英] Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?

查看:85
本文介绍了为什么gcc不能将_mm256_loadu_pd解析为单个vmovupd?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一些 AVX 代码,我需要从可能未对齐的内存中加载.我目前正在加载4个 doubles ,因此我将使用内部指令 _mm256_loadu_pd ;我写的代码是:

I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles, hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is:

__m256d d1 = _mm256_loadu_pd(vInOut + i*4);

然后使用选项-O3 -mavx -g进行编译,随后使用 objdump 来获取汇编代码以及带注释的代码和行(objdump -S -M intel -l avx.obj).代码,我发现以下内容:

I've then compiled with options -O3 -mavx -g and subsequently used objdump to get the assembler code plus annotated code and line (objdump -S -M intel -l avx.obj).
When I look into the underlying assembler code, I find the following:

vmovupd xmm0,XMMWORD PTR [rsi+rax*1]
vinsertf128 ymm0,ymm0,XMMWORD PTR [rsi+rax*1+0x10],0x1

我希望看到这个:

vmovupd ymm0,XMMWORD PTR [rsi+rax*1]

并完全使用256位寄存器( ymm0 ),相反,看起来 gcc 已决定填充128位部分( xmm0 ),然后再次使用 vinsertf128 加载另一半.

and fully use the 256 bit register (ymm0), instead it looks like gcc has decided to fill in the 128 bit part (xmm0) and then load again the other half with vinsertf128.

有人可以解释吗?
在MSVC VS 2012中,等效代码是使用单个 vmovupd 进行编译的.

Is someone able to explain this?
Equivalent code is getting compiled with a single vmovupd in MSVC VS 2012.

我正在 Ubuntu 18.04 x86-64 上运行gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0.

推荐答案

GCC的默认调整(-mtune=generic)包括-mavx256-split-unaligned-load-mavx256-split-unaligned-store ,因为这会在某些CPU上带来较小的加速(例如第一代Sandybridge和某些AMD CPU)在某些情况下实际上在运行时内存未对齐.

GCC's default tuning (-mtune=generic) includes -mavx256-split-unaligned-load and -mavx256-split-unaligned-store, because that gives a minor speedup on some CPUs (e.g. first-gen Sandybridge, and some AMD CPUs) in some cases when memory is actually misaligned at runtime.

如果不想这样做,请使用-O3 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store,或者更好的是,使用-mtune=haswell.或使用-march=native来针对您自己的计算机进行优化.没有"generic-avx2"调整. ( https://gcc.gnu.org/onlinedocs/gcc/x86-Options. html ).

Use -O3 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store if you don't want this, or better, use -mtune=haswell. Or use -march=native to optimize for your own computer. There's no "generic-avx2" tuning. (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html).

Intel Sandybridge作为单个uop运行256位负载,在负载端口中需要2个周期. (与AMD将所有256位向量指令解码为2个单独的uops不同.)Sandybridge存在未对齐的256位负载的问题(如果地址在运行时实际上未对齐).我不知道详细信息,也没有找到有关放慢速度的具体信息.也许是因为它使用具有16字节存储区的存储缓存?但是IvyBridge可以更好地处理256位负载,并且仍然具有存储缓存.

Intel Sandybridge runs 256-bit loads as a single uop that takes 2 cycles in a load port. (Unlike AMD which decodes all 256-bit vector instructions as 2 separate uops.) Sandybridge has a problem with unaligned 256-bit loads (if the address is actually misaligned at runtime). I don't know the details, and haven't found much specific info on exactly what the slowdown is. Perhaps because it uses a banked cache, with 16-byte banks? But IvyBridge handles 256-bit loads better and still has banked cache.

根据GCC邮件列表消息中有关实现该选项的代码(它将某些SPEC CPU 2006基准测试速度提高了6%. "(我想是针对Sandybridge的,这是当时唯一的Intel AVX CPU.)

According to the GCC mailing list message about the code that implements the option (https://gcc.gnu.org/ml/gcc-patches/2011-03/msg01847.html), "It speeds up some SPEC CPU 2006 benchmarks by up to 6%." (I think that's for Sandybridge, the only Intel AVX CPU that existed at the time.)

但是,如果运行时内存实际上是32字节对齐的,那么即使在Sandybridge和大多数AMD CPU 1 上,这也纯属不利.因此,使用此调整选项,您可能会因未能告知编译器对齐保证而蒙受损失.而且,如果循环大部分时间在对齐的内存上运行,则最好至少使用-mno-avx256-split-unaligned-load或暗示该内容的调整选项来编译该编译单元.

But if memory is actually 32-byte aligned at runtime, this is pure downside even on Sandybridge and most AMD CPUs1. So with this tuning option, you potentially lose just from failing to tell your compiler about alignment guarantees. And if your loop runs on aligned memory most of the time, you'd better compile at least that compilation unit with -mno-avx256-split-unaligned-load or tuning options that imply that.

一直使用软件进行拆分会增加成本.让硬件处理它可以使对齐的大小写完全有效(除了桩驱动器 1 上的存储除外),对齐的大小写可能比某些CPU上的软件拆分慢.因此,这是一种悲观的方法,如果确实很有可能数据在运行时确实未对齐,那么就有意义,而不是只是不能保证总是在编译时对齐.例如也许您有一个大多数时候都使用对齐的缓冲区调用的函数,但是您仍然希望它在罕有/少量的情况下使用未对齐的缓冲区调用时起作用.在这种情况下,即使在Sandybridge上,分段加载/存储策略也是不合适的.

Splitting in software imposes the cost all the time. Letting hardware handle it makes the aligned case perfectly efficient (except stores on Piledriver1), with the misaligned case possibly slower than with software splitting on some CPUs. So it's the pessimistic approach, and makes sense if it's really likely that the data really is misaligned at runtime, rather than just not guaranteed to always be aligned at compile time. e.g. maybe you have a function that's called most of the time with aligned buffers, but you still want it to work for rare / small cases where it's called with misaligned buffers. In that case, a split-load/store strategy is inappropriate even on Sandybridge.

缓冲区通常是16字节对齐的,而不是32字节对齐的,因为x86-64 glibc(在libstdc ++中为new)上的malloc返回16字节对齐的缓冲区(因为alignof(maxalign_t) == 16).对于较大的缓冲区,指针通常在页面开始之后为16个字节,因此对于大于16的对齐方式,它总是未对齐.请使用aligned_alloc代替.

It's common for buffers to be 16-byte aligned but not 32-byte aligned because malloc on x86-64 glibc (and new in libstdc++) returns 16-byte aligned buffers (because alignof(maxalign_t) == 16). For large buffers, the pointer is normally 16 bytes after the start of a page, so it's always misaligned for alignments larger than 16. Use aligned_alloc instead.

请注意,-mavx-mavx2根本不会更改调整选项:gcc -O3 -mavx2仍可针对所有 CPU进行调整,包括实际上无法运行AVX2指令的代码.这非常愚蠢,因为如果要调整平均AVX2 CPU",则应使用单个未对齐的256位负载.不幸的是,gcc没有选择这样做,并且-mavx2并不暗示-mno-avx256-split-unaligned-load或其他任何内容. 请参见 https://gcc.gnu.org/bugzilla/show_bug.cgi ?id = 80568 https://gcc.gnu.org/bugzilla/show_bug.cgi?id = 78762 用于功能请求具有指令集选择影响调整的功能.

Note that -mavx and -mavx2 don't change tuning options at all: gcc -O3 -mavx2 still tunes for all CPUs, including ones that can't actually run AVX2 instructions. This is pretty dumb, because you should use a single unaligned 256-bit load if tuning for "the average AVX2 CPU". Unfortunately gcc has no option to do that, and -mavx2 doesn't imply -mno-avx256-split-unaligned-load or anything. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 for feature requests to have instruction-set selection influence tuning.

这就是为什么您应该使用-march=native来制作供本地使用的二进制文件,或者也许使用-march=sandybridge -mtune=haswell来制作可以在各种计算机上运行但可以在大多数具有AVX的较新硬件上运行的二进制文件的原因. (请注意,即使Skylake Pentium/Celeron CPU也没有AVX或BMI2;可能在256位执行单元或寄存器文件的上半部分有任何缺陷的CPU上,它们也会禁用VEX前缀的解码并将其作为低端出售.奔腾.)

This is why you should use -march=native to make binaries for local use, or maybe -march=sandybridge -mtune=haswell to make binaries that can run on a wide range of machines, but will probably mostly run on newer hardware that has AVX. (Note that even Skylake Pentium/Celeron CPUs don't have AVX or BMI2; probably on CPUs with any defects in the upper half of 256-bit execution units or register files, they disable decoding of VEX prefixes and sell them as low-end Pentium.)

gcc8.2的调整选项如下. (-march=x表示-mtune=x). https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html .

gcc8.2's tuning options are as follows. (-march=x implies -mtune=x). https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html.

我检查了

I checked on the Godbolt compiler explorer by compiling with -O3 -fverbose-asm and looking at the comments which include a full dump of all implied options. I included _mm256_loadu/storeu_ps functions, and a simple float loop that can auto-vectorize, so we can also look at what the compiler does.

使用-mprefer-vector-width=256(gcc8)或-mno-prefer-avx128(gcc7和更早版本)覆盖诸如-mtune=bdver3之类的调整选项,并根据需要获得256位自动矢量化,而不仅仅是手动矢量化.

Use -mprefer-vector-width=256 (gcc8) or -mno-prefer-avx128 (gcc7 and earlier) to override tuning options like -mtune=bdver3 and get 256-bit auto-vectorization if you want, instead of only with manual vectorization.

  • 默认值/-mtune=generic:-mavx256-split-unaligned-load-store .可以说,随着Intel Haswell的出现以及后来变得越来越普遍,它的适用性越来越低,而我认为最近的AMD CPU的缺点仍然很小.尤其是拆分未对齐的 loads ,而AMD调整选项则不启用.
  • -march=sandybridge-march=ivybridge:将两者分开. (我想我已经读过IvyBridge改进了对未对齐的256位加载或存储的处理,因此它不太适合在运行时可能对齐数据的情况.)
  • -march=haswell 及更高版本:均未启用拆分选项.
  • -march=knl:均未启用拆分选项. (Silvermont/Atom没有AVX)
  • -mtune=intel :均未启用拆分选项.即使使用gcc8,使用-mtune=intel -mavx进行自动矢量化也会选择达到读/写目标数组的对齐边界,这与gcc8仅使用不对齐的常规策略不同. (同样,在另一种情况下,软件处理总是要花钱,而让硬件来处理这种例外情况.)
  • default / -mtune=generic: both -mavx256-split-unaligned-load and -store. Arguably less and less appropriate as Intel Haswell and later become more common, and the downside on recent AMD CPUs is I think still small. Especially splitting unaligned loads, which AMD tuning options don't enable.
  • -march=sandybridge and -march=ivybridge: split both. (I think I've read that IvyBridge improved handling of unaligned 256-bit loads or stores, so it's less appropriate for cases where the data might be aligned at runtime.)
  • -march=haswell and later: neither splitting option enabled.
  • -march=knl: neither splitting option enabled. (Silvermont/Atom don't have AVX)
  • -mtune=intel: neither splitting option enabled. Even with gcc8, auto-vectorization with -mtune=intel -mavx chooses to reach an alignment boundary for the read/write destination array, unlike gcc8's normal strategy of just using unaligned. (Again, another case of software handling that always has a cost vs. letting the hardware deal with the exceptional case.)
  • -march=bdver1 (Bulldozer): -mavx256-split-unaligned-store, but not loads. It also sets the gcc8 equivalent gcc7 and earlier -mprefer-avx128 (auto-vectorization will only use 128-bit AVX, but of course intrinsics can still use 256-bit vectors).
  • -march=bdver2 (Piledriver), bdver3 (Steamroller), bdver4 (Excavator). same as Bulldozer. They auto-vectorize an FP a[i] += b[i] loop with software prefetch and enough unrolling to only prefetch once per cache line!
  • -march=znver1 (Zen): -mavx256-split-unaligned-store but not loads, still auto-vectorizing with only 128-bit, but this time without SW prefetch.
  • -march=btver2 (AMD Fam16h, aka Jaguar): neither splitting option enabled, auto-vectorizing like Bulldozer-family with only 128-bit vectors + SW prefetch.
  • -march=eden-x4 (Via Eden with AVX2): neither splitting option enabled, but the -march option doesn't even enable -mavx, and auto-vectorization uses movlps / movhps 8-byte loads, which is really dumb. At least use movsd instead of movlps to break the false dependency. But if you enable -mavx, it uses 128-bit unaligned loads. Really weird / inconsistent behaviour here, unless there's some strange front-end for this.

选项(例如,作为-march = sandybridge的一部分启用,大概也适用于Bulldozer系列(-march = bdver2是桩驱动程序).但是,当编译器知道内存已对齐时,这不能解决问题./p>

options (enabled as part of -march=sandybridge for example, presumably also for Bulldozer-family (-march=bdver2 is piledriver). That doesn't solve the problem when the compiler knows the memory is aligned, though.

脚注1:AMD Piledriver的性能错误使256位存储吞吐量变得可怕:根据Agner Fog的microarch pdf文件,甚至vmovaps [mem], ymm对齐的存储每17到20个时钟运行一个存储(

Footnote 1: AMD Piledriver has a performance bug that makes 256-bit store throughput terrible: even vmovaps [mem], ymm aligned stores running one per 17 to 20 clocks according to Agner Fog's microarch pdf (https://agner.org/optimize/). This effect isn't present in Bulldozer or Steamroller/Excavator.

Agner Fog说,Bulldozer/Piledriver上的256位AVX吞吐量通常(而不是专门加载/存储)通常比128位AVX差,部分原因是它无法以2-2 uop模式解码指令. Steamroller使256位接近收支平衡(如果不花费额外的洗牌).但是寄存器-寄存器vmovaps ymm指令仍然只能通过推土机系列中低128位的消除来受益.

Agner Fog says 256-bit AVX throughput in general (not loads/stores specifically) on Bulldozer/Piledriver is typically worse than 128-bit AVX, partly because it can't decode instructions in a 2-2 uop pattern. Steamroller makes 256-bit close to break-even (if it doesn't cost extra shuffles). But register-register vmovaps ymm instructions still only benefit from mov-elimination for the low 128 bits on Bulldozer-family.

但是,闭源软件或二进制发行版通常没有在每个目标体系结构上都使用-march=native进行构建的奢侈之处,因此在制作可以在任何支持AVX的CPU上运行的二进制文件时,需要进行权衡.只要在其他CPU上不存在灾难性的不利影响,通常在某些CPU上使用256位代码实现大幅提速是值得的.

But closed-source software or binary distributions typically don't have the luxury of building with -march=native on every target architecture, so there's a tradeoff when making a binary that can run on any AVX-supporting CPU. Gaining big speedup with 256-bit code on some CPUs is typically worth it as long as there aren't catastrophic downsides on other CPUs.

拆分未对齐的加载/存储是为了避免在某些CPU上出现大问题的尝试.在最近的CPU上,它会产生额外的uop吞吐量和额外的ALU uops.但是至少vinsertf128 ymm, [mem], 1不需要Haswell/Skylake的端口5上的随机播放单元:它可以在任何矢量ALU端口上运行. (而且它没有微熔丝,所以它的前端带宽要2微克.)

Splitting unaligned loads/stores is an attempt to avoid big problems on some CPUs. It costs extra uop throughput, and extra ALU uops, on recent CPUs. But at least vinsertf128 ymm, [mem], 1 doesn't need the shuffle unit on port 5 on Haswell/Skylake: it can run on any vector ALU port. (And it doesn't micro-fuse, so it costs 2 uops of front-end bandwidth.)

PS:

大多数代码不是由最先进的编译器编译的,因此现在更改通用"调优将花费一些时间,然后才能使用经过更新的调优编译的代码. (当然,大多数代码仅使用-O2-O3进行编译,无论如何,此选项仅会影响AVX代码源.但是不幸的是,许多人使用-O3 -mavx2而不是-O3 -march=native.因此,他们可能会错过FMA ,BMI1/2,popcnt以及其CPU支持的其他功能.

Most code isn't compiled by bleeding edge compilers, so changing the "generic" tuning now will take a while before code compiled with an updated tuning will get into use. (Of course, most code is compiled with just -O2 or -O3, and this option only affects AVX code-gen anyway. But many people unfortunately use -O3 -mavx2 instead of -O3 -march=native. So they can miss out on FMA, BMI1/2, popcnt, and other things their CPU supports.

这篇关于为什么gcc不能将_mm256_loadu_pd解析为单个vmovupd?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆