为什么 gcc 不将 _mm256_loadu_pd 解析为单个 vmovupd? [英] Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?

查看:35
本文介绍了为什么 gcc 不将 _mm256_loadu_pd 解析为单个 vmovupd?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一些 AVX 代码,我需要从可能未对齐的内存中加载.我目前正在加载 4 个 doubles,因此我会使用内在指令 _mm256_loadu_pd;我写的代码是:

I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles, hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is:

__m256d d1 = _mm256_loadu_pd(vInOut + i*4);

然后我使用选项 -O3 -mavx -g 进行编译,随后使用 objdump 来获取汇编代码以及带注释的代码和行 (objdump -S -M intel -l avx.obj).
当我查看底层汇编代码时,我发现以下内容:

I've then compiled with options -O3 -mavx -g and subsequently used objdump to get the assembler code plus annotated code and line (objdump -S -M intel -l avx.obj).
When I look into the underlying assembler code, I find the following:

vmovupd xmm0,XMMWORD PTR [rsi+rax*1]
vinsertf128 ymm0,ymm0,XMMWORD PTR [rsi+rax*1+0x10],0x1

我期待看到这个:

vmovupd ymm0,XMMWORD PTR [rsi+rax*1]

并完全使用 256 位寄存器 (ymm0),而看起来 gcc 已决定填充 128 位部分 (xmm0>) 然后用 vinsertf128 再次加载另一半.

and fully use the 256 bit register (ymm0), instead it looks like gcc has decided to fill in the 128 bit part (xmm0) and then load again the other half with vinsertf128.

有人能解释一下吗?
在 MSVC VS 2012 中使用单个 vmovupd 编译等效代码.

Is someone able to explain this?
Equivalent code is getting compiled with a single vmovupd in MSVC VS 2012.

我在 Ubuntu 18.04 x86-64 上运行 gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0.

推荐答案

GCC 的默认调优 (-mtune=generic) 包括 -mavx256-split-unaligned-load-mavx256-split-unaligned-store,因为在某些情况下,当内存不足时,这会在某些 CPU(例如第一代 Sandybridge 和某些 AMD CPU)上提供较小的加速实际上在运行时未对齐.

GCC's default tuning (-mtune=generic) includes -mavx256-split-unaligned-load and -mavx256-split-unaligned-store, because that gives a minor speedup on some CPUs (e.g. first-gen Sandybridge, and some AMD CPUs) in some cases when memory is actually misaligned at runtime.

使用 -O3 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store 如果你不想要这个,或者更好,使用 -mtune=haswell. 或使用 -march=native 为您自己的计算机进行优化.没有通用 avx2"调整.(https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html).

Use -O3 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store if you don't want this, or better, use -mtune=haswell. Or use -march=native to optimize for your own computer. There's no "generic-avx2" tuning. (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html).

英特尔 Sandybridge 将 256 位加载作为单个 uop 运行,在加载端口中需要 2 个周期.(与 AMD 将所有 256 位向量指令解码为 2 个单独的 uops 不同.)Sandybridge 存在未对齐的 256 位加载问题(如果地址在运行时实际上未对齐).我不知道细节,也没有找到关于放缓究竟是什么的具体信息.也许是因为它使用了带 16 字节存储体的存储缓存?但是 IvyBridge 可以更好地处理 256 位负载,并且仍然有银行缓存.

Intel Sandybridge runs 256-bit loads as a single uop that takes 2 cycles in a load port. (Unlike AMD which decodes all 256-bit vector instructions as 2 separate uops.) Sandybridge has a problem with unaligned 256-bit loads (if the address is actually misaligned at runtime). I don't know the details, and haven't found much specific info on exactly what the slowdown is. Perhaps because it uses a banked cache, with 16-byte banks? But IvyBridge handles 256-bit loads better and still has banked cache.

根据关于实现该选项的代码的 GCC 邮件列表消息 (https://gcc.gnu.org/ml/gcc-patches/2011-03/msg01847.html),"它使某些 SPEC CPU 2006 基准测试的速度提高了 6%."(我认为这是针对 Sandybridge,当时唯一存在的 Intel AVX CPU.)

According to the GCC mailing list message about the code that implements the option (https://gcc.gnu.org/ml/gcc-patches/2011-03/msg01847.html), "It speeds up some SPEC CPU 2006 benchmarks by up to 6%." (I think that's for Sandybridge, the only Intel AVX CPU that existed at the time.)

但如果内存在运行时实际上是 32 字节对齐的,那么即使在 Sandybridge 和大多数 AMD CPU 上,这也是纯粹的缺点1.因此,使用此调整选项,您可能会因为未能将对齐保证告知编译器而蒙受损失.如果您的循环大部分在对齐的内存上运行,您最好至少使用 -mno-avx256-split-unaligned-load 或调整编译该编译单元暗示这一点的选项.

But if memory is actually 32-byte aligned at runtime, this is pure downside even on Sandybridge and most AMD CPUs1. So with this tuning option, you potentially lose just from failing to tell your compiler about alignment guarantees. And if your loop runs on aligned memory most of the time, you'd better compile at least that compilation unit with -mno-avx256-split-unaligned-load or tuning options that imply that.

在软件中拆分总是会产生成本.让硬件处理它使对齐的情况非常有效(除了在 Piledriver1 上的存储),未对齐的情况可能比在某些 CPU 上使用软件拆分慢.所以这是一种悲观的方法,如果数据真的很可能在运行时没有对齐,而不是在编译时不能保证总是对齐,那么这是有道理的.例如也许您有一个大部分时间都使用对齐缓冲区调用的函数,但您仍然希望它适用于使用未对齐缓冲区调用的罕见/小情况.在这种情况下,即使在 Sandybridge 上,拆分加载/存储策略也是不合适的.

Splitting in software imposes the cost all the time. Letting hardware handle it makes the aligned case perfectly efficient (except stores on Piledriver1), with the misaligned case possibly slower than with software splitting on some CPUs. So it's the pessimistic approach, and makes sense if it's really likely that the data really is misaligned at runtime, rather than just not guaranteed to always be aligned at compile time. e.g. maybe you have a function that's called most of the time with aligned buffers, but you still want it to work for rare / small cases where it's called with misaligned buffers. In that case, a split-load/store strategy is inappropriate even on Sandybridge.

缓冲区通常是 16 字节对齐但不是 32 字节对齐,因为 x86-64 glibc 上的 malloc(和 libstdc++ 中的 new)返回 16 字节对齐的缓冲区(因为 alignof(maxalign_t) == 16).对于大缓冲区,指针通常在页面开始后 16 个字节,因此对于大于 16 的对齐,它总是未对齐.请改用 aligned_alloc.

It's common for buffers to be 16-byte aligned but not 32-byte aligned because malloc on x86-64 glibc (and new in libstdc++) returns 16-byte aligned buffers (because alignof(maxalign_t) == 16). For large buffers, the pointer is normally 16 bytes after the start of a page, so it's always misaligned for alignments larger than 16. Use aligned_alloc instead.

请注意,-mavx-mavx2 根本不会改变调整选项:gcc -O3 -mavx2 仍然适用于所有 CPU,包括那些不能实际运行 AVX2 指令的 CPU.这非常愚蠢,因为如果针对平均 AVX2 CPU"进行调整,您应该使用单个未对齐的 256 位负载.不幸的是 gcc 没有选择这样做,并且 -mavx2 并不意味着 -mno-avx256-split-unaligned-load 或任何东西.参见 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 用于指令集选择影响调整的功能请求.

Note that -mavx and -mavx2 don't change tuning options at all: gcc -O3 -mavx2 still tunes for all CPUs, including ones that can't actually run AVX2 instructions. This is pretty dumb, because you should use a single unaligned 256-bit load if tuning for "the average AVX2 CPU". Unfortunately gcc has no option to do that, and -mavx2 doesn't imply -mno-avx256-split-unaligned-load or anything. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 for feature requests to have instruction-set selection influence tuning.

这就是为什么你应该使用 -march=native 来制作本地使用的二进制文件,或者使用 -march=sandybridge -mtune=haswell 来制作可以运行的二进制文件在各种机器上,但可能主要在具有 AVX 的较新硬件上运行.(请注意,即使 Skylake Pentium/Celeron CPU 也没有 AVX 或 BMI2;可能在 256 位执行单元或寄存器文件的上半部分存在任何缺陷的 CPU 上,它们禁用 VEX 前缀的解码并将它们作为低端奔腾.)

This is why you should use -march=native to make binaries for local use, or maybe -march=sandybridge -mtune=haswell to make binaries that can run on a wide range of machines, but will probably mostly run on newer hardware that has AVX. (Note that even Skylake Pentium/Celeron CPUs don't have AVX or BMI2; probably on CPUs with any defects in the upper half of 256-bit execution units or register files, they disable decoding of VEX prefixes and sell them as low-end Pentium.)

gcc8.2 的调优选项如下.(-march=x 意味着 -mtune=x).https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html.

gcc8.2's tuning options are as follows. (-march=x implies -mtune=x). https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html.

我检查了 关于 Godbolt 编译器探索r 通过使用 -O3 -fverbose-asm 编译并查看包含所有隐含选项的完整转储的注释.我包含了 _mm256_loadu/storeu_ps 函数和一个可以自动矢量化的简单浮点循环,所以我们也可以看看编译器做了什么.

I checked on the Godbolt compiler explorer by compiling with -O3 -fverbose-asm and looking at the comments which include a full dump of all implied options. I included _mm256_loadu/storeu_ps functions, and a simple float loop that can auto-vectorize, so we can also look at what the compiler does.

使用 -mprefer-vector-width=256 (gcc8) 或 -mno-prefer-avx128(gcc7 及更早版本)覆盖调整选项,如 -mtune=bdver3 并根据需要获得 256 位自动矢量化,而不仅仅是手动矢量化.

Use -mprefer-vector-width=256 (gcc8) or -mno-prefer-avx128 (gcc7 and earlier) to override tuning options like -mtune=bdver3 and get 256-bit auto-vectorization if you want, instead of only with manual vectorization.

  • default/-mtune=generic:-mavx256-split-unaligned-load-store.可以说随着 Intel Haswell 越来越不合适,后来变得越来越普遍,我认为最近 AMD CPU 的缺点仍然很小.尤其是拆分未对齐的负载,AMD 调整选项无法启用.
  • -march=sandybridge-march=ivybridge:将两者分开.(我想我已经读过 IvyBridge 改进了对未对齐的 256 位加载或存储的处理,因此它不太适用于数据可能在运行时对齐的情况.)
  • -march=haswell 及更高版本:均未启用拆分选项.
  • -march=knl:未启用拆分选项.(Silvermont/Atom 没有 AVX)
  • -mtune=intel:均未启用拆分选项.即使使用 gcc8,使用 -mtune=intel -mavx 的自动矢量化选择达到读/写目标数组的对齐边界,这与 gcc8 仅使用未对齐的常规策略不同.(同样,另一种总是有成本的软件处理案例,而不是让硬件处理例外情况.)
  • default / -mtune=generic: both -mavx256-split-unaligned-load and -store. Arguably less and less appropriate as Intel Haswell and later become more common, and the downside on recent AMD CPUs is I think still small. Especially splitting unaligned loads, which AMD tuning options don't enable.
  • -march=sandybridge and -march=ivybridge: split both. (I think I've read that IvyBridge improved handling of unaligned 256-bit loads or stores, so it's less appropriate for cases where the data might be aligned at runtime.)
  • -march=haswell and later: neither splitting option enabled.
  • -march=knl: neither splitting option enabled. (Silvermont/Atom don't have AVX)
  • -mtune=intel: neither splitting option enabled. Even with gcc8, auto-vectorization with -mtune=intel -mavx chooses to reach an alignment boundary for the read/write destination array, unlike gcc8's normal strategy of just using unaligned. (Again, another case of software handling that always has a cost vs. letting the hardware deal with the exceptional case.)
  • -march=bdver1(推土机):-mavx256-split-unaligned-store,但不加载.它还设置了 gcc8 等效的 gcc7 和更早的 -mprefer-avx128(自动矢量化将仅使用 128 位 AVX,但内在函数当然仍然可以使用 256 位矢量).
  • -march=bdver2(打桩机)、bdver3(压路机)、bdver4(挖掘机).和推土机一样.他们使用软件预取和足够的展开自动矢量化一个 FP a[i] += b[i] 循环,每个缓存行只预取一次!
  • -march=znver1 (Zen):-mavx256-split-unaligned-store 但不加载,仍然只使用 128 位自动矢量化,但这一次没有软件预取.
  • -march=btver2(AMD Fam16h,又名 Jaguar):既没有启用拆分选项,也没有像推土机系列那样自动矢量化,只有 128 位矢量 + 软件预取.
  • -march=eden-x4(通过带有 AVX2 的 Eden):两个拆分选项都没有启用,但是 -march 选项甚至没有启用 -mavx,并且自动矢量化使用 movlps/movhps 8 字节加载,这真的很愚蠢.至少使用 movsd 而不是 movlps 来打破错误依赖.但是,如果您启用 -mavx,它将使用 128 位未对齐加载.这里真的很奇怪/不一致的行为,除非有一些奇怪的前端.

  • -march=bdver1 (Bulldozer): -mavx256-split-unaligned-store, but not loads. It also sets the gcc8 equivalent gcc7 and earlier -mprefer-avx128 (auto-vectorization will only use 128-bit AVX, but of course intrinsics can still use 256-bit vectors).
  • -march=bdver2 (Piledriver), bdver3 (Steamroller), bdver4 (Excavator). same as Bulldozer. They auto-vectorize an FP a[i] += b[i] loop with software prefetch and enough unrolling to only prefetch once per cache line!
  • -march=znver1 (Zen): -mavx256-split-unaligned-store but not loads, still auto-vectorizing with only 128-bit, but this time without SW prefetch.
  • -march=btver2 (AMD Fam16h, aka Jaguar): neither splitting option enabled, auto-vectorizing like Bulldozer-family with only 128-bit vectors + SW prefetch.
  • -march=eden-x4 (Via Eden with AVX2): neither splitting option enabled, but the -march option doesn't even enable -mavx, and auto-vectorization uses movlps / movhps 8-byte loads, which is really dumb. At least use movsd instead of movlps to break the false dependency. But if you enable -mavx, it uses 128-bit unaligned loads. Really weird / inconsistent behaviour here, unless there's some strange front-end for this.

选项(例如,作为 -march=sandybridge 的一部分启用,大概也适用于 Bulldozer 系列(-march=bdver2 是打桩机).但是,当编译器知道内存已对齐时,这并不能解决问题.

options (enabled as part of -march=sandybridge for example, presumably also for Bulldozer-family (-march=bdver2 is piledriver). That doesn't solve the problem when the compiler knows the memory is aligned, though.

脚注 1:AMD Piledriver 有一个性能错误,导致 256 位存储吞吐量很糟糕:根据 Agner Fog 的 microarch pdf,即使 vmovaps [mem], ymm 对齐的存储每 17 到 20 个时钟运行一个(https://agner.org/optimize/).此效果在推土机或压路机/挖掘机中不存在.

Footnote 1: AMD Piledriver has a performance bug that makes 256-bit store throughput terrible: even vmovaps [mem], ymm aligned stores running one per 17 to 20 clocks according to Agner Fog's microarch pdf (https://agner.org/optimize/). This effect isn't present in Bulldozer or Steamroller/Excavator.

Agner Fog 说 Bulldozer/Piledriver 上的 256 位 AVX 吞吐量通常(不是专门加载/存储)通常比 128 位 AVX 差,部分原因是它无法以 2-2 uop 模式解码指令.Steamroller 使 256 位接近收支平衡(如果它不花费额外的洗牌).但是 register-register vmovaps ymm 指令仍然只受益于 Bulldozer 系列上低 128 位的 mov-elimination.

Agner Fog says 256-bit AVX throughput in general (not loads/stores specifically) on Bulldozer/Piledriver is typically worse than 128-bit AVX, partly because it can't decode instructions in a 2-2 uop pattern. Steamroller makes 256-bit close to break-even (if it doesn't cost extra shuffles). But register-register vmovaps ymm instructions still only benefit from mov-elimination for the low 128 bits on Bulldozer-family.

但是闭源软件或二进制发行版通常无法在每个目标架构上使用 -march=native 进行构建,因此在制作可以在任何目标架构上运行的二进制文件时需要权衡支持 AVX 的 CPU.只要在其他 CPU 上没有灾难性的缺点,在某些 CPU 上使用 256 位代码获得大幅加速通常是值得的.

But closed-source software or binary distributions typically don't have the luxury of building with -march=native on every target architecture, so there's a tradeoff when making a binary that can run on any AVX-supporting CPU. Gaining big speedup with 256-bit code on some CPUs is typically worth it as long as there aren't catastrophic downsides on other CPUs.

拆分未对齐的加载/存储是为了避免某些 CPU 出现大问题.在最近的 CPU 上,它需要额外的 uop 吞吐量和额外的 ALU uop.但至少 vinsertf128 ymm, [mem], 1 不需要 Haswell/Skylake 端口 5 上的 shuffle 单元:它可以在任何向量 ALU 端口上运行.(而且它没有微熔断器,因此需要 2 uop 的前端带宽.)

Splitting unaligned loads/stores is an attempt to avoid big problems on some CPUs. It costs extra uop throughput, and extra ALU uops, on recent CPUs. But at least vinsertf128 ymm, [mem], 1 doesn't need the shuffle unit on port 5 on Haswell/Skylake: it can run on any vector ALU port. (And it doesn't micro-fuse, so it costs 2 uops of front-end bandwidth.)

附注:

大多数代码不是由前沿编译器编译的,因此现在更改通用"调整需要一段时间才能使用更新的调整编译的代码.(当然,大多数代码只用 -O2-O3 编译,无论如何这个选项只会影响 AVX 代码生成.但不幸的是,很多人使用 -O3 -mavx2 而不是 -O3 -march=native.因此他们可能会错过 FMA、BMI1/2、popcnt 以及他们的 CPU 支持的其他东西.

Most code isn't compiled by bleeding edge compilers, so changing the "generic" tuning now will take a while before code compiled with an updated tuning will get into use. (Of course, most code is compiled with just -O2 or -O3, and this option only affects AVX code-gen anyway. But many people unfortunately use -O3 -mavx2 instead of -O3 -march=native. So they can miss out on FMA, BMI1/2, popcnt, and other things their CPU supports.

这篇关于为什么 gcc 不将 _mm256_loadu_pd 解析为单个 vmovupd?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆