编译器是否将SSE指令用于常规C代码? [英] Does compiler use SSE instructions for a regular C code?

查看:195
本文介绍了编译器是否将SSE指令用于常规C代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到人们默认使用 -msse -msse2 -mfpmath = sse 标志,希望这会提高性能。我知道在C代码中使用特殊的向量类型时,SSE会参与其中。但是这些标志对常规C代码有什么区别吗?

I see people using -msse -msse2 -mfpmath=sse flags by default hoping that this will improve performance. I know that SSE gets engaged when special vector types are used in the C code. But do these flags make any difference for regular C code? Does compiler use SSE to optimize regular C code?

推荐答案

是的,如果您使用完全优化进行编译,现代编译器会使用SSE2自动矢量化。

Yes, modern compilers auto-vectorize with SSE2 if you compile with full optimization. clang enables it even at -O2, gcc at -O3.

即使在-O1或-Os上,编译器也将使用SIMD加载/存储指令来复制或初始化结构或其他对象的宽度大于整数寄存器。这实际上不算自动矢量化;它更像是针对固定大小的小块的默认内置memset / memcpy策略的一部分。

Even at -O1 or -Os, compilers will use SIMD load/store instructions to copy or initialize structs or other objects wider than an integer register. That doesn't really count as auto-vectorization; it's more like part of their default builtin memset / memcpy strategy for small fixed-size blocks. But it does take advantage of and require SIMD instructions to be supported.

SSE2是基准/非可选的,并且确实需要利用SIMD指令。对于x86-64,因此编译器在定位x86-64时始终可以使用SSE1 / SSE2指令。必须手动启用后面的指令集(SSE4,AVX,AVX2,AVX512和非SIMD扩展,如BMI2,popcnt等),以告诉编译器可以使代码无法在较旧的CPU上运行。或者让它生成多个版本的代码并在运行时进行选择,但这会产生额外的开销,仅对于较大的功能才值得。

SSE2 is baseline / non-optional for x86-64, so compilers can always use SSE1/SSE2 instructions when targeting x86-64. Later instruction sets (SSE4, AVX, AVX2, AVX512, and non-SIMD extensions like BMI2, popcnt, etc.) have to be enabled manually to tell the compiler it's ok to make code that won't run on older CPUs. Or to get it to generate multiple versions of code and choose at runtime, but that has extra overhead and is only worth it for larger functions.

-msse -msse2 -mfpmath = sse 已经是x86-64的默认设置,但不是32位i386的默认设置。一些32位调用约定在x87寄存器中返回FP值,因此使用SSE / SSE2进行计算,然后必须存储/重新加载结果以在x87中获得它很不方便 st(0)。使用 -mfpmath = sse ,更聪明的编译器可能仍会使用x87进行计算以生成FP返回值。

-msse -msse2 -mfpmath=sse is already the default for x86-64, but not for 32-bit i386. Some 32-bit calling conventions return FP values in x87 registers, so it can be inconvenient to use SSE/SSE2 for computation and then have to store/reload the result to get it in x87 st(0). With -mfpmath=sse, smarter compilers might still use x87 for a calculation that produces an FP return value.

在32位x86上,默认情况下可能未启用 -msse2 ,这取决于编译器的配置方式。如果您使用32位是因为您要针对的是太旧而不能运行64位代码的CPU,则可能要确保已将其禁用,或者仅 -msse

On 32-bit x86, -msse2 might not be on by default, it depends on how your compiler was configured. If you're using 32-bit because you're targeting CPUs that are so old they can't run 64-bit code, you might want to make sure it's disabled, or only -msse.

对要编译的CPU进行二进制调整的最佳方法是 -O3 -march = native -mfpmath = sse ,并使用链接时优化+配置文件引导的优化。 (gcc -fprofile-generate /在某些测试数据上运行/ gcc -fprofile-use )。

The best way to make a binary tuned for the CPU you're compiling on is -O3 -march=native -mfpmath=sse, and use link-time optimization + profile-guided optimization. (gcc -fprofile-generate / run on some test data / gcc -fprofile-use).

如果编译器确实选择使用新指令,则使用 -march = native 可能会使二进制文件无法在较早的CPU上运行。配置文件引导的优化对gcc非常有用:没有它,它永远不会展开循环。但是使用PGO,它知道哪些循环经常运行/进行了很多次迭代,即哪些循环很``热'',值得花更多的代码大小。链接时优化允许跨文件内联/恒定传播。如果您的C ++具有很多实际上未在头文件中定义的小功能,则非常有用

Using -march=native makes binaries that might not run on earlier CPUs, if the compiler does choose to use new instructions. Profile-guided optimization is very helpful for gcc: it never unrolls loops without it. But with PGO, it knows which loops run often / for a lot of iterations, i.e. which loops are "hot" and worth spending more code-size on. Link-time optimization allows inlining / constant-propagation across files. It's very helpful if you have C++ with a lot of small functions that you don't actually define in header files.

请参阅如何删除噪声是从GCC / clang程序集输出中获得的? ?有关查看编译器输出及其意义的更多信息。

See How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output and making sense of it.

以下是一些特定的示例在适用于x86-64 的Godbolt编译器浏览器上。 Godbolt还具有用于其他几种架构的gcc,并且使用clang可以添加 -target mips 或其他任何东西,因此您还可以使用正确的编译器选项来查看ARM NEON的自动矢量化启用它。您可以在x86-64编译器上使用 -m32 来获取32位代码。

Here are some specific examples on the Godbolt compiler explorer for x86-64. Godbolt also has gcc for several other architectures, and with clang you can add -target mips or whatever, so you can also see auto-vectorization for ARM NEON with the right compiler options to enable it. You can use -m32 with the x86-64 compilers to get 32-bit code-gen.

int sumint(int *arr) {
    int sum = 0;
    for (int i=0 ; i<2048 ; i++){
        sum += arr[i];
    }
    return sum;
}

具有 gcc8.1 -O3的内循环(没有 -march = haswell 或任何启用AVX / AVX2的东西):

inner loop with gcc8.1 -O3 (without -march=haswell or anything to enable AVX/AVX2):

.L2:                                 # do {
    movdqu  xmm2, XMMWORD PTR [rdi]    # load 16 bytes
    add     rdi, 16
    paddd   xmm0, xmm2                 # packed add of 4 x 32-bit integers
    cmp     rax, rdi
    jne     .L2                      # } while(p != endp)

    # then horizontal add and extract a single 32-bit sum

没有 -fast-math ,编译器可以可以对FP操作进行重新排序,因此 float 不会自动矢量化(请参见Godbolt链接:标量 addss )。 (OpenMP可以在每个循环的基础上启用它,或使用 -ffast-math )。

Without -ffast-math, compilers can't reorder FP operations, so the float equivalent don't auto-vectorize (see the Godbolt link: you get scalar addss). (OpenMP can enable it on a per-loop basis, or use -ffast-math).

但是有些FP素材可以安全地自动矢量化,而无需更改操作顺序。

But some FP stuff can safely auto-vectorize without changing order of operations.

// clang won't contract this into an FMA without -ffast-math :/
// but gcc will (if you compile with -march=haswell)
void scale_array(float *arr) {
    for (int i=0 ; i<2048 ; i++){
        arr[i] = arr[i] * 2.1f + 1.234f;
    }
}

  # load constants: xmm2 = {2.1,  2.1,  2.1,  2.1}
  #                 xmm1 = (1.23, 1.23, 1.23, 1.23}
.L9:   # gcc8.1 -O3                       # do {
    movups  xmm0, XMMWORD PTR [rdi]         # load unaligned packed floats
    add     rdi, 16
    mulps   xmm0, xmm2                      # multiply Packed Single-precision
    addps   xmm0, xmm1                      # add Packed Single-precision
    movups  XMMWORD PTR [rdi-16], xmm0      # store back to the array
    cmp     rax, rdi
    jne     .L9                           # }while(p != endp)

乘数= 2.0 f 导致使用 addps 加倍,从而在Haswell / Broadwell上将吞吐量减少了2倍!因为在SKL之前,FP add仅在一个运行执行端口,但是有两个FMA单元可以运行乘法运算。 mul和FMA。 ( http://agner.org/optimize/ ,并在 x86标签Wiki 。)

multiplier = 2.0f results in using addps to double, cutting throughput by a factor of 2 on Haswell / Broadwell! Because before SKL, FP add only runs on one execution port, but there are two FMA units that can run multiplies. SKL dropped the adder and runs add with the same 2 per clock throughput and latency as mul and FMA. (http://agner.org/optimize/, and see other performance links in the x86 tag wiki.)

使用 -march =进行编译haswell 允许编译器将单个FMA用于scale + add。 (但是,除非使用 -ffast-math ,否则clang不会将表达式压缩为FMA。IIRC可以选择启用FP压缩,而无需其他激进操作。)

Compiling with -march=haswell lets the compiler use a single FMA for the scale + add. (But clang won't contract the expression into an FMA unless you use -ffast-math. IIRC there's an option to enable FP contraction without other aggressive operations.)

这篇关于编译器是否将SSE指令用于常规C代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆