矩阵/向量运算的 GCC 优化标志 [英] GCC optimization flags for matrix/vector operations

查看:63
本文介绍了矩阵/向量运算的 GCC 优化标志的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 C 执行矩阵运算.我想知道哪些编译器优化标志可以提高双精度和 int64 数据的这些矩阵运算的执行速度 - 如乘法、逆等.我不是在寻找手动优化代码,我只想使用编译器标志使本机代码更快,并了解有关这些标志的更多信息.

I am performing matrix operations using C. I would like to know what are the various compiler optimization flags to improve speed of execution of these matrix operations for double and int64 data - like Multiplication, Inverse, etc. I am not looking for hand optimized code, I just want to make the native code more faster using compiler flags and learn more about these flags.

到目前为止我发现的可以改进矩阵代码的标志.

The flags that I have found so far which improve matrix code.

-O3/O4
-funroll-loops
-ffast-math

推荐答案

首先,我不推荐使用-ffast-math,原因如下:

First of all, I don't recommend using -ffast-math for the following reasons:

  1. 事实证明,性能实际上降级,当在大多数(如果不是全部)情况下使用此选项.所以快速数学"是实际上没有那么快.

  1. It has been proved that the performance actually degrades when using this option in most (if not all) cases. So "fast math" is not actually that fast.

此选项打破了 IEEE 对浮点数的严格合规性最终导致计算累积的操作不可预测的错误.

This option breaks strict IEEE compliance on floating-point operations which ultimately results in accumulation of computational errors of unpredictable nature.

你在不同的环境中很可能得到不同的结果,差异可能是重大的.术语环境(在这种情况下)意味着以下组合:硬件、操作系统,编译器.这意味着当你可以得到意想不到的情况的多样性结果呈指数增长.

You may well get different results in different environments and the difference may be substantial. The term environment (in this case) implies the combination of: hardware, OS, compiler. Which means that the diversity of situations when you can get unexpected results has exponential growth.

另一个可悲的后果是那些链接到使用此选项构建的库可能期望正确的(符合 IEEE 标准的)浮点数学,这是他们的期望在哪里,但很难想象找出原因.

Another sad consequence is that programs which link against the library built with this option might expect correct (IEEE compliant) floating-point math, and this is where their expectations break, but it will be very tough to figure out why.

最后,看看这篇文章.

出于同样的原因,您应该避免使用 -Ofast(因为它包含邪恶的 -ffast-math).提取:

For the same reasons you should avoid -Ofast (as it includes the evil -ffast-math). Extract:

-Ofast

无视严格的标准合规性.-Ofast 启用所有 -O3 优化.它还支持对所有符合标准的程序都无效的优化.它打开 -ffast-math 和 Fortran 特定的 -fno-protect-parens-fstack-arrays.

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.

没有像 -O4 这样的标志.至少我不知道那个,在官方 GCC 文档中也没有任何痕迹.所以在这方面的最大值是 -O3 并且你绝对应该使用它,不仅是为了优化数学,而且在一般的发布版本中.

There is no such flag as -O4. At least I'm not aware of that one, and there is no trace of it in the official GCC documentation. So the maximum in this regard is -O3 and you should be definitely using it, not only to optimize math, but in release builds in general.

-funroll-loops 是数学例程的一个非常好的选择,特别是涉及向量/矩阵运算,其中循环的大小可以在编译时推导出(结果由编译器).

-funroll-loops is a very good choice for math routines, especially involving vector/matrix operations where the size of the loop can be deduced at compile-time (and as a result unrolled by the compiler).

我可以再推荐 2 个标志:-march=native-mfpmath=sse.与 -O3 类似,-march=native 通常适用于任何软件的发布版本,而不仅仅是数学密集型.-mfpmath=sse 允许在浮点指令中使用 XMM 寄存器(而不是 中的堆栈)x87 模式).

I can recommend 2 more flags: -march=native and -mfpmath=sse. Similarly to -O3, -march=native is good in general for release builds of any software and not only math intensive. -mfpmath=sse enables use of XMM registers in floating point instructions (instead of stack in x87 mode).

此外,我想说很遗憾,您不想修改代码以获得更好的性能,因为这是向量/矩阵例程加速的主要来源.感谢 SIMDSSE Intrinsics矢量化,重线性代数代码可以比没有它们快几个数量级.然而,这些技术的正确应用需要对其内部结构有深入的了解,并且需要花费相当多的时间/精力来修改(实际上是重写)代码.

Furthermore, I'd like to say that it's a pity that you don't want to modify your code to get better performance as this is the main source of speedup for vector/matrix routines. Thanks to SIMD, SSE Intrinsics, and Vectorization, the heavy-linear-algebra code can be orders of magnitude faster than without them. However, proper application of these techniques requires in-depth knowledge of their internals and quite some time/effort to modify (actually rewrite) the code.

不过,有一种选择可能适合您的情况.GCC 提供了 自动矢量化,可以通过 -ftree 启用-vectorize,但没有必要,因为您使用的是 -O3(因为它已经包含 -ftree-vectorize).关键是您仍然应该帮助 GCC 了解哪些代码可以自动矢量化.修改通常很小(如果需要的话),但你必须让自己熟悉它们.所以请参阅上面链接中的可矢量化循环部分.

Nevertheless, there is one option that could be suitable in your case. GCC offers auto-vectorization which can be enabled by -ftree-vectorize, but it is unnecessary since you are using -O3 (because it includes -ftree-vectorize already). The point is that you should still help GCC a little bit to understand which code can be auto-vectorized. The modifications are usually minor (if needed at all), but you have to make yourself familiar with them. So see the Vectorizable Loops section in the link above.

最后,我建议您查看 Eigen,C++ 模板-基于库,它具有最常见的线性代数例程的高效实现.它以一种非常聪明的方式利用了迄今为止这里提到的所有技术.该界面纯粹是面向对象的,整洁且易于使用.面向对象的方法看起来与线性代数非常相关,因为它通常操作纯对象,例如矩阵、向量、四元数、旋转、过滤器等.因此,在使用 Eigen 编程时,您无需自己处理如此低级的概念(如 SSE、矢量化等),而只需享受解决您的特定问题的乐趣.

Finally, I recommend you to look into Eigen, the C++ template-based library which has highly efficient implementation of most common linear algebra routines. It utilizes all the techniques mentioned here so far in a very clever way. The interface is purely object-oriented, neat, and pleasing to use. The object-oriented approach looks very relevant to linear algebra as it usually manipulates the pure objects such as matrices, vectors, quaternions, rotations, filters, and so on. As a result, when programming with Eigen, you never have to deal with such low level concepts (as SSE, Vectorization, etc.) yourself, but just enjoy solving your specific problem.

这篇关于矩阵/向量运算的 GCC 优化标志的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆