simd 的 pragma omp 不会在 GCC 中生成向量指令 [英] pragma omp for simd does not generate vector instructions in GCC

查看:52
本文介绍了simd 的 pragma omp 不会在 GCC 中生成向量指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简短:pragma omp for simd OpenMP 指令是否生成使用 SIMD 寄存器的代码?

更长:如 OpenMP 文档 中所述,工作共享循环 SIMD 构造指定一个或多个相关循环的迭代将分布在使用 SIMD 指令的已经存在的线程之间[..]".从这个语句中,我希望以下代码 (simd.c) 在编译运行 时使用 XMMYMMZMM 寄存器gcc simd.c -o simd -fopenmp 但它没有.

#include #define N 100int main() {int x[N];int y[N];int z[N];国际我;整数总和;for(i=0; i 

检查生成的汇编器运行 gcc simd.c -S -fopenmp 没有使用 SIMD 寄存器.

我可以使用选项 -O3 在没有 OpenMP 的情况下使用 SIMD 寄存器,因为根据 GCC 文档它包括 -ftree-vectorize 标志.

  • XMM 寄存器:gcc simd.c -o simd -O3
  • YMM 寄存器:gcc simd.c -o simd -O3 -march=skylake-avx512
  • ZMM 寄存器:gcc simd.c -o simd -O3 -march=skylake-avx512 -mprefer-vector-width=512

但是,将标志 -march=skylake-avx512 -mprefer-vector-width=512-fopenmp 结合使用不会生成 SIMD 指令.

因此,我可以使用 -O3 轻松矢量化我的代码,而无需 pragma omp for simd,但反过来不行.

此时,我的目的不是生成 SIMD 指令,而是了解 OpenMP SIMD 指令如何在 GCC 中工作以及如何仅使用 OpenMP(没有 -O3)生成 SIMD 指令.>

解决方案

至少启用 -O2 才能让 -fopenmp 正常工作,并提高整体性能

<块引用>

gcc simd.c -S -fopenmp

GCC 的默认值是 -O0,针对一致性调试进行了反优化.它永远不会使用 -O0 自动矢量化,因为当来自 C 源代码的每个 i 值都必须存在于内存中时,它毫无意义,等等.为什么clang会产生低效的asm使用 -O0(对于这个简单的浮点和)?

当您必须能够一次单步执行源代码行,甚至在运行时使用调试器修改 i 或内存内容,并让程序像您一样继续运行时,也是不可能的期望 C 抽象机会.

在没有任何优化的情况下构建对于性能来说是完全垃圾;甚至考虑您是否足够关心性能以使用 OpenMP 是很疯狂的.(当然,实际调试除外.)通常,从反优化到优化标量的加速比矢量化标量所获得的要多代码,但两者都可能是很大的因素,因此您肯定需要超越自动矢量化的优化.

<小时><块引用>

我可以使用选项 -O3 在没有 OpenMP 的情况下使用 SIMD 寄存器,因为根据 GCC 文档,它包含 -ftree-vectorize 标志.

好的,就这么办吧.-O3 -march=native -flto 通常是将在编译主机上运行的代码的最佳选择.另外 -fno-trapping-math -fno-math-errno 应该对一切都是安全的,并启用一些更好的 FP 函数内联,即使你不想要 -ffast-math.也最好是 -fprofile-generate/-fprofile-use 配置文件引导优化 (PGO),以展开热循环并适当地选择分支与无分支等.

#pragma omp parallel-O3 -fopenmp 仍然有效 - GCC 默认不启用自动并行化.

此外,#pragma omp simd 有时会使用不同的矢量化风格.在您的情况下,它似乎让 GCC 忘记它知道数组是 16 字节对齐的,并使用 movdqu 加载(当 AVX 不可用于 paddd 的未对齐内存源操作数时xmm0, [rax]).比较 https://godbolt.org/z/8q8Dqm - main._omp_fn.0: main 调用的辅助函数不假设对齐.(尽管在除以线程数之后,它可能无法将数组拆分为多个范围,如果 GCC 不费心去做向量大小的块?)

<小时>

使用 -O2 -fopenmp 得到你想要的

OpenMP 将让 gcc 更容易或更有效地向量化循环,其中您没有在指向函数的指针参数上使用 restrict 以使其知道数组不重叠,或者对于浮点让即使您没有使用 -ffast-math,它也假装 FP 数学是关联的.

或者如果你启用了一些优化但不是完全优化(例如 -O2 不包括 -ftree-vectorize),然后 #pragma omp 将按您预期的方式工作.

请注意,x[i] = y[i] = i; init 循环不会在 -O2 处自动矢量化,但是 #pragma 循环是.没有 -fopenmp,纯标量. Godbolt编译探险

<小时>

串行-O3 代码对于这个小N 会运行得更快,因为线程启动开销远不值得.但是对于大 N,如果单个内核不能使内存带宽饱和(例如,在 Xeon 上,但大多数双核/四核台式机 CPU 几乎可以用一个内核使内存带宽饱和),那么并行化可能会有所帮助.或者,如果您的阵列在不同内核上的缓存中很热.

不幸的是(?)即使是 GCC -O3 也无法在整个代码中进行常量传播并仅打印结果.或者将 z[i] = x[i]+y[i] 循环与 sum(x[]) 循环融合.

Short: Does the pragma omp for simd OpenMP directive generate code that uses SIMD registers?

Longer: As stated in the OpenMP documentation "The worksharing-loop SIMD construct specifies that the iterations of one or more associated loops will be distributed across threads that already exist [..] using SIMD instructions". From this statement, I would expect the following code (simd.c) to use XMM, YMM or ZMM registers when compiling running gcc simd.c -o simd -fopenmp but it does not.

#include <stdio.h>
#define N 100

int main() {
    int x[N];
    int y[N];
    int z[N];
    int i;
    int sum;

    for(i=0; i < N; i++) {
        x[i] = i;
        y[i] = i;
    }

    #pragma omp parallel
    {
        #pragma omp for simd
        for(i=0; i < N; i++) {
            z[i] = x[i] + y[i];
        }
        #pragma omp for simd reduction(+:sum)
        for(i=0; i < N; i++) {
            sum += x[i];
        }
    }
    printf("%d %d\n",z[N/2], sum);

    return 0;
}

When checking the assembler generated running gcc simd.c -S -fopenmp no SIMD register is used.

I can use SIMD registers without OpenMP using the option -O3 because according to GCC documentation it includes the -ftree-vectorize flag.

  • XMM registers: gcc simd.c -o simd -O3
  • YMM registers: gcc simd.c -o simd -O3 -march=skylake-avx512
  • ZMM registers: gcc simd.c -o simd -O3 -march=skylake-avx512 -mprefer-vector-width=512

However, using the flags -march=skylake-avx512 -mprefer-vector-width=512 combined with -fopenmp does not generates SIMD instructions.

Therefore, I can easily vectorize my code with -O3 without the pragma omp for simd but not for the other way around.

At this point, my purpose is not to generate SIMD instructions but to understand how do OpenMP SIMD directives work in GCC and how to generate SIMD instructions only with OpenMP (without -O3).

解决方案

Enable at least -O2 for -fopenmp to work, and for performance in general

gcc simd.c -S -fopenmp

GCC's default is -O0, anti-optimized for consistent debugging. It's never going to auto-vectorize with -O0 because it's pointless when every i value from the C source has to exist in memory, and so on. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?

Also impossible when you have to be able to single-step source lines one at a time, and even modify i or memory contents at runtime with the debugger, and have the program keep running like you'd expect the C abstract machine would.

Building without any optimization is utter garbage for performance; it's insane to even consider if you care about performance enough to be using OpenMP. (Except of course for actual debugging.) Often the speedup from anti-optimized to optimized scalar is more than what you could gain from vectorizing that scalar code, but both can be large factors so you definitely want optimizations beyond auto-vectorization.


I can use SIMD registers without OpenMP using the option -O3 because according to GCC documentation it includes the -ftree-vectorize flag.

Right, so do that. -O3 -march=native -flto is usually your best bet for code that will run on the compile host. Also -fno-trapping-math -fno-math-errno should be safe for everything and enable some better FP function inlining, even if you don't want -ffast-math. Also preferably -fprofile-generate / -fprofile-use profile-guided optimization (PGO), to unroll hot loops and choose branchy vs. branchless appropriately, etc.

#pragma omp parallel is still effective at -O3 -fopenmp - GCC doesn't enable autoparallelization by default.

Also, #pragma omp simd will use a different vectorization style sometimes. In your case, it seems to make GCC forget that it knows the arrays are 16-byte aligned, and use movdqu loads (when AVX isn't available for an unaligned memory source operand for paddd xmm0, [rax]). Compare https://godbolt.org/z/8q8Dqm - the main._omp_fn.0: helper function that main calls doesn't assume alignment. (Although maybe it can't after division by number of threads splits up the array into ranges, if GCC doesn't bother to do vector-sized chunks?)


Use -O2 -fopenmp to get what you were expecting

OpenMP will let gcc vectorize more easily or efficiently for loops where you didn't use restrict on pointer args to functions to let it know that arrays don't overlap, or for floating point to let it pretend that FP math is associative even if you didn't use -ffast-math.

Or if you enable some optimization but not full optimization (e.g. -O2 which doesn't include -ftree-vectorize), then #pragma omp will work the way you expected.

Note that the x[i] = y[i] = i; init loop doesn't get auto-vectorized at -O2, but the #pragma loops are. And that without -fopenmp, pure scalar. Godbolt compiler explorer


The serial -O3 code will run faster for this small N because thread-startup overhead is nowhere near worth it. But for large N, parallelization could help if a single core can't saturate memory bandwidth (e.g. on a Xeon, but most dual/quad-core desktop CPUs can almost saturate mem bandwidth with one core). Or if your arrays are hot in cache on different cores.

Unfortunately(?) even GCC -O3 doesn't manage to do constant-propagation through your whole code and just print the result. Or to fuse the z[i] = x[i]+y[i] loop with the sum(x[]) loop.

这篇关于simd 的 pragma omp 不会在 GCC 中生成向量指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆