为什么C ++编译器的常数折叠效果更好? [英] Why don't C++ compilers do better constant folding?

查看:152
本文介绍了为什么C ++编译器的常数折叠效果更好?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究加速大部分C ++代码的方法,这些C ++代码具有用于自动计算jacobian的派生工具.这涉及在实际残差中做一些工作,但是大部分工作(基于分析的执行时间)是在计算jacobian.

I'm investigating ways to speed up a large section of C++ code, which has automatic derivatives for computing jacobians. This involves doing some amount of work in the actual residuals, but the majority of the work (based on profiled execution time) is in calculating the jacobians.

这让我感到惊讶,因为大多数雅各布派人都是从0和1向前传播的,所以工作量应该是函数的2-4倍,而不是10到12倍.为了建模大量的jacobian工作,我举了一个超级极小的示例,其中仅给出了一个点积(而不是sin,cos,sqrt等实际情况)优化为单个返回值:

This surprised me, since most of the jacobians are propagated forward from 0s and 1s, so the amount of work should be 2-4x the function, not 10-12x. In order to model what a large amount of the jacobian work is like, I made a super minimal example with just a dot product (instead of sin, cos, sqrt and more that would be in a real situation) that the compiler should be able to optimize to a single return value:

#include <Eigen/Core>
#include <Eigen/Geometry>

using Array12d = Eigen::Matrix<double,12,1>;

double testReturnFirstDot(const Array12d& b)
{
    Array12d a;
    a.array() = 0.;
    a(0) = 1.;
    return a.dot(b);
}

应与

double testReturnFirst(const Array12d& b)
{
    return b(0);
}

我很失望地发现,如果不启用快速数学,那么GCC 8.2,Clang 6或MSVC 19都无法对矩阵为0的朴素点积进行任何优化.即使使用快速运算( https://godbolt.org/z/GvPXFy ),优化效果仍然很差在GCC和Clang中(仍然涉及乘法和加法),而MSVC根本不做任何优化.

I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s. Even with fast-math (https://godbolt.org/z/GvPXFy) the optimizations are very poor in GCC and Clang (still involve multiplications and additions), and MSVC doesn't do any optimizations at all.

我没有编译器背景,但是有这个原因吗?我相当确定,即使恒定折叠本身并不会加快速度,但在科学计算中,能够进行更好的恒定传播/折叠将使更多的优化工作变得显而易见.

I don't have a background in compilers, but is there a reason for this? I'm fairly sure that in a large proportion of scientific computations being able to do better constant propagation/folding would make more optimizations apparent, even if the constant-fold itself didn't result in a speedup.

尽管我对为什么在编译器方面没有做到这一点感兴趣,但我也对在实际方面可以做些什么,以便在面对这类模式时更快地编写自己的代码感兴趣./p>

While I'm interested in explanations for why this isn't done on the compiler side, I'm also interested for what I can do on a practical side to make my own code faster when facing these kinds of patterns.

推荐答案

这是因为Eigen将您的代码显式矢量化为其余4个组件寄存器中的3个vmulpd,2个vaddpd和1个水平缩小(假设AVX,只有SSE会得到6 mulpd和5 addpd).使用-ffast-math,允许GCC和clang删除最后2个vmulpd和vaddpd(这是它们的工作),但是它们不能真正替代Eigen明确生成的其余vmulpd和水平缩小.

This is because Eigen explicitly vectorize your code as 3 vmulpd, 2 vaddpd and 1 horizontal reduction within the remaining 4 component registers (this assumes AVX, with SSE only you'll get 6 mulpd and 5 addpd). With -ffast-math GCC and clang are allowed to remove the last 2 vmulpd and vaddpd (and this is what they do) but they cannot really replace the remaining vmulpd and horizontal reduction that have been explicitly generated by Eigen.

那么如果通过定义EIGEN_DONT_VECTORIZE禁用Eigen的显式矢量化怎么办?然后,您得到了预期的结果( https://godbolt.org/z/UQsoeH ),但还有其他代码可能会变慢得多.

So what if you disable Eigen's explicit vectorization by defining EIGEN_DONT_VECTORIZE? Then you get what you expected (https://godbolt.org/z/UQsoeH) but other pieces of code might become much slower.

如果您想在本地禁用显式矢量化并且不怕与Eigen的内部混乱,则可以在Matrix中引入DontVectorize选项,并通过专门针对这种Matrix类型的traits<>来禁用矢量化:

If you want to locally disable explicit vectorization and are not afraid of messing with Eigen's internal, you can introduce a DontVectorize option to Matrix and disable vectorization by specializing traits<> for this Matrix type:

static const int DontVectorize = 0x80000000;

namespace Eigen {
namespace internal {

template<typename _Scalar, int _Rows, int _Cols, int _MaxRows, int _MaxCols>
struct traits<Matrix<_Scalar, _Rows, _Cols, DontVectorize, _MaxRows, _MaxCols> >
: traits<Matrix<_Scalar, _Rows, _Cols> >
{
  typedef traits<Matrix<_Scalar, _Rows, _Cols> > Base;
  enum {
    EvaluatorFlags = Base::EvaluatorFlags & ~PacketAccessBit
  };
};

}
}

using ArrayS12d = Eigen::Matrix<double,12,1,DontVectorize>;

完整示例: https://godbolt.org/z/bOEyzv

这篇关于为什么C ++编译器的常数折叠效果更好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆