我需要在 2021 年使用 _mm256_zeroupper 吗? [英] Do I need to use _mm256_zeroupper in 2021?

查看:44
本文介绍了我需要在 2021 年使用 _mm256_zeroupper 吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

摘自 Agner Fog 的用 C++ 优化软件":

在某些 Intel 上混合使用和不使用 AVX 支持编译的代码时出现问题处理器.从 AVX 代码到非 AVX 代码会导致性能下降因为 YMM 寄存器状态发生了变化.应该通过调用来避免这种惩罚从 AVX 代码到非 AVX 代码的任何转换之前的内在函数 _mm256_zeroupper().在以下情况下,这可能是必要的:

There is a problem when mixing code compiled with and without AVX support on some Intel processors. There is a performance penalty when going from AVX code to non-AVX code because of a change in the YMM register state. This penalty should be avoided by calling the intrinsic function _mm256_zeroupper() before any transition from AVX code to nonAVX code. This can be necessary in the following cases:

• 如果程序的一部分是使用 AVX 支持编译的,而程序的另一部分是在没有 AVX 支持的情况下编译,然后在离开之前调用 _mm256_zeroupper()AVX 部分.

• If part of a program is compiled with AVX support and another part of the program is compiled without AVX support then call _mm256_zeroupper() before leaving the AVX part.

• 如果一个函数在使用 CPU 的有和没有 AVX 的多个版本中编译调度然后在离开 AVX 部分之前调用 _mm256_zeroupper().

• If a function is compiled in multiple versions with and without AVX using CPU dispatching then call _mm256_zeroupper() before leaving the AVX part.

• 如果使用 AVX 支持编译的一段代码调用库中的函数而不是编译器自带的库,并且库没有AVX支持,那么在调用库函数之前调用_mm256_zeroupper().

• If a piece of code compiled with AVX support calls a function in a library other than the library that comes with the compiler, and the library has no AVX support, then call _mm256_zeroupper() before calling the library function.

我想知道什么是一些英特尔处理器.具体来说,是否有过去五年制造的处理器.这样我就知道修复丢失的 _mm256_zeroupper() 调用是否为时已晚.

I'm wondering what are some Intel processors. Specifically, are there processors made in the last five years. So that I know if it is too late to fix missing _mm256_zeroupper() calls or not.

推荐答案

TL:DR: 不要手动使用 _mm256_zeroupper() intrinsic,编译器理解 SSE/AVX 转换内容并在您需要的地方发出 vzeroupper.(包括使用 YMM regs 自动矢量化或扩展 memcpy/memset/任何内容时.)

TL:DR: Don't use the _mm256_zeroupper() intrinsic manually, compilers understand SSE/AVX transition stuff and emit vzeroupper where needed for you. (Including when auto-vectorizing or expanding memcpy/memset/whatever with YMM regs.)

某些英特尔处理器"至强融核除外.

"Some Intel processors" being all except Xeon Phi.

至强融核 (KNL/KNM) 没有为运行传统 SSE 指令而优化的状态,因为它们纯粹是为运行 AVX-512 而设计的.旧版 SSE 指令可能总是将错误的依赖项合并到目标中.

Xeon Phi (KNL / KNM) don't have a state optimized for running legacy SSE instructions because they're purely designed to run AVX-512. Legacy SSE instructions probably always have false dependencies merging into the destination.

在带有 AVX 或更高版本的主流 CPU 上,有两种不同的机制:保存脏鞋面(SnB 通过 Haswell 和 Ice Lake)或虚假依赖(Skylake).见为什么这个SSE代码是6次Skylake 上没有 VZEROUPPER 会更慢? 两种不同风格的 SSE/AVX 惩罚

On mainstream CPUs with AVX or later, there are two different mechanisms: saving dirty uppers (SnB through Haswell, and Ice Lake) or false dependencies (Skylake). See Why is this SSE code 6 times slower without VZEROUPPER on Skylake? the two different styles of SSE/AVX penalty

关于 asm vzeroupper 效果的相关问答(在编译器生成的机器码中):

Related Q&As about the effects of asm vzeroupper (in the compiler-generated machine code):

您几乎不应该在 C/C++ 源代码中使用 _mm256_zeroupper().事情已经解决,让编译器在可能需要的地方自动插入 vzeroupper 指令,这几乎是编译器能够优化包含内在函数的函数并仍然可靠地避免转换惩罚的唯一明智方法.(特别是在考虑内联时).所有主要编译器都可以使用 YMM 寄存器自动矢量化和/或内联 memcpy/memset/array init,因此需要跟踪使用 vzeroupper 之后.

You should pretty much never use _mm256_zeroupper() in C/C++ source code. Things have settled on having the compiler insert a vzeroupper instruction automatically where it might be needed, which is pretty much the only sensible way for compilers to be able to optimize functions containing intrinsics and still reliably avoid transition penalties. (Especially when considering inlining). All the major compilers can auto-vectorize and/or inline memcpy/memset/array init with YMM registers, so need to keep track of using vzeroupper after that.

约定是在调用或返回时让 CPU 处于清理状态,除非调用采用 __m256/__m256i/d args 按值(在寄存器中或根本没有),或在返回这样的值时.目标函数(被调用者或调用者)本质上必须是 AVX 感知的,并且需要一个dirty-upper 状态,因为完整的 YMM 寄存器作为调用约定的一部分正在使用中.

The convention is to have the CPU in clean-uppers state when calling or returning, except when calling functions that take __m256 / __m256i/d args by value (in registers or at all), or when returning such a value. The target function (callee or caller) inherently must be AVX-aware and expecting a dirty-upper state because a full YMM register is in-use as part of the calling convention.

x86-64 System V 在向量 regs 中传递向量.Windows vectorcall 也是如此,但原始的 Windows x64 约定(现在命名为fastcall"以区别于vectorcall")通过隐藏指针在内存中按值传递向量.(这通过使每个 arg 始终适合 8 字节槽来优化可变参数函数.)IDK 编译 Windows 非向量调用调用的编译器如何处理此问题,无论它们是否假设函数可能查看其 args 或至少仍然负责使用一个 vzeroupper 在某个时候,即使它没有.可能是的,但如果您正在编写自己的代码生成后端或手写 asm,如果这种情况与您相关,请查看您关心的一些编译器实际执行的操作.

x86-64 System V passes vectors in vector regs. Windows vectorcall does, too, but the original Windows x64 convention (now named "fastcall" to distinguish from "vectorcall") passes vectors by value in memory via hidden pointer. (This optimizes for variadic functions by making every arg always fit in an 8-byte slot.) IDK how compilers compiling Windows non-vectorcall calls handle this, whether they assume the function probably looks at its args or at least is still responsible for using a vzeroupper at some point even if it doesn't. Probably yes, but if you're writing your own code-gen back-end, or hand-written asm, have a look at what some compilers you care about actually do if this case is relevant for you.

一些编译器通过在从采用向量参数的函数返回之前省略 vzeroupper 进行优化,因为显然调用者是 AVX 感知的.至关重要的是,显然编译器不应该期望调用像 void foo(__m256i) 这样的函数会使 CPU 处于清理状态,所以被调用者仍然需要一个 vzeroupper 在这样的函数之后,在 call printf 或其他之前.

Some compilers optimize by also omitting vzeroupper before returning from a function that took vector args, because clearly the caller is AVX-aware. And crucially, apparently compilers shouldn't expect that calling a function like void foo(__m256i) will leave the CPU in clean-upper state, so the callee does still need a vzeroupper after such a function, before call printf or whatever.

例如,GCC -mno-vzeroupper/clang -mllvm -x86-use-vzeroupper=0.(默认为 -mvzeroupper 以执行上述行为,在可能需要时使用.)

For example, GCC -mno-vzeroupper / clang -mllvm -x86-use-vzeroupper=0. (The default is -mvzeroupper to do the behaviour described above, using when it might be needed.)

这是由 -march=knl(Knight's Landing)暗示的,因为它在 Xeon Phi CPU 上不需要并且非常慢(因此应该主动避免).

This is implied by -march=knl (Knight's Landing) because it's not needed and very slow on Xeon Phi CPUs (thus should actively be avoided).

或者,如果您使用 -mavx -mno-veroupper 构建 libc(以及您使用的任何其他库),您可能需要它.glibc 为 strlen 等函数提供了一些手写的 asm,但其中大多数都有 AVX2 版本.因此,只要您使用的不是仅支持 AVX1 的 CPU,就可能根本无法使用传统 SSE 版本的字符串函数.

Or you might possibly want it if you build libc (and any other libraries you use) with -mavx -mno-veroupper. glibc has some hand-written asm for functions like strlen, but most of those have AVX2 versions. So as long as you're not on an AVX1-only CPU, legacy-SSE versions of string functions might not get used at all.

对于 MSVC,在编译使用 AVX 内在函数的代码时,您绝对应该更喜欢使用 -arch:AVX.我认为如果将 __m128__m256/arch:AVX 混合使用,某些版本的 MSVC 可能会生成导致转换惩罚的代码.但请注意,该选项将使像 _mm_add_ps 这样的 128 位内部函数使用 AVX 编码 (vaddps) 而不是传统的 SSE (addps),尽管如此,并且会让编译器使用 AVX 自动矢量化.

For MSVC, you should definitely prefer using -arch:AVX when compiling code that uses AVX intrinsics. I think some versions of MSVC could generate code that caused transition penalties if you mixed __m128 and __m256 without /arch:AVX. But beware that that option will make even 128-bit intrinsics like _mm_add_ps use the AVX encoding (vaddps) instead of legacy SSE (addps), though, and will let the compiler auto-vectorize with AVX.

这篇关于我需要在 2021 年使用 _mm256_zeroupper 吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆