Eigen乘以小矩阵的速度慢吗? [英] Is Eigen slow at multiplying small matrices?

查看:93
本文介绍了Eigen乘以小矩阵的速度慢吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个函数,将10x10维的本征矩阵相乘.然后,我编写了一个天真的乘法函数 CustomMultiply ,它比Eigen的实现快了2倍.

I wrote a function that multiplies Eigen matrices of dimension 10x10 together. Then I wrote a naive multiply function CustomMultiply which was surprisingly 2x faster than Eigen's implementation.

我尝试了几个不同的编译标志,例如-O2和-O3,它们没有什么不同.

I tried a couple of different compilation flags like -O2 and -O3, which did not make a difference.

  #include <Eigen/Core>

  constexpr int dimension = 10;
  using Matrix = Eigen::Matrix<double, dimension, dimension>;

  Matrix CustomMultiply(const Matrix& a, const Matrix& b) {
    Matrix result = Matrix::Zero();
    for (int bcol_idx = 0; bcol_idx < dimension; ++bcol_idx) {
      for (int brow_idx = 0; brow_idx < dimension; ++brow_idx) {
        result.col(bcol_idx).noalias() += a.col(brow_idx) * b(brow_idx, bcol_idx);
      }
    }
    return result;
  }

  Matrix PairwiseMultiplyEachMatrixNoAlias(int num_repetitions, const std::vector<Matrix>& input) {
    Matrix acc = Matrix::Zero();
    for (int i = 0; i < num_repetitions; ++i) {
      for (const auto& matrix_a : input) {
        for (const auto& matrix_b : input) {
          acc.noalias() += matrix_a * matrix_b;
        }
      }
    }
    return acc;
  }

  Matrix PairwiseMultiplyEachMatrixCustom(int num_repetitions, const std::vector<Matrix>& input) {
    Matrix acc = Matrix::Zero();
    for (int i = 0; i < num_repetitions; ++i) {
      for (const auto& matrix_a : input) {
        for (const auto& matrix_b : input) {
          acc.noalias() += CustomMultiply(matrix_a, matrix_b);
        }
      }
    }
    return acc;
  }

当我传入100个随机矩阵作为 input 并使用100作为 num_repetitions时,我的机器上

PairwiseMultiplyEachMatrixNoAlias 在我的 PairwiseMultiplyEachMatrixCustom 上慢2倍.我的机器详细信息:Intel Xeon CPU E5-2630 v4,Ubuntu 16.04,Eigen 3

PairwiseMultiplyEachMatrixNoAlias is 2x slower on PairwiseMultiplyEachMatrixCustom on my machine when I pass in 100 random matrices as input and use 100 as num_repetitions. My machine details: Intel Xeon CPU E5-2630 v4, Ubuntu 16.04, Eigen 3

更新:经过评论中的有益讨论后,进行以下修改后结果不变.

Updates: Results are unchanged after the following modifications after helpful discussion in the comments

  • num_repetitions = 1 input.size()= 1000
  • 使用 .lazyProduct()并使用 .eval()实际上会导致更多减速
  • c 8.0.0
  • g ++ 9.2
  • 使用标志 -march = native -DNDEBUG
  • num_repetitions = 1 and input.size() = 1000
  • using .lazyProduct() and using .eval() actually leads to further slowdown
  • clang 8.0.0
  • g++ 9.2
  • using flags -march=native -DNDEBUG

更新2:
跟随@dtell在Google Benchmark库中的发现,我发现了一个有趣的结果.将2个矩阵与Eigen相乘的速度比自定义快,但是将许多矩阵与Eigen相乘的速度慢2倍,符合先前的发现.

Updates 2:
Following up on @dtell's findings with Google Benchmark library, I found an interesting result. Multiplying 2 matrices with Eigen is faster than custom, but multiplying many matrices with Eigen is 2x slower, in line with the previous findings.

这是我的Google Benchmark代码.(注意: GenerateRandomMatrices()函数中存在一个旁白,现在已修复.)

Here is my Google Benchmark code. (Note: There was an off-by one in the GenerateRandomMatrices() function below which is now fixed.)

#include <Eigen/Core>
#include <Eigen/StdVector>
#include <benchmark/benchmark.h>

constexpr int dimension = 10;
constexpr int num_random_matrices = 10;
using Matrix = Eigen::Matrix<double, dimension, dimension>;
using Eigen_std_vector = std::vector<Matrix,Eigen::aligned_allocator<Matrix>>;

Eigen_std_vector GetRandomMatrices(int num_matrices) {
  Eigen_std_vector matrices;
  for (int i = 0; i < num_matrices; ++i) {
    matrices.push_back(Matrix::Random());
  }
  return matrices;
}

Matrix CustomMultiply(const Matrix& a, const Matrix& b) {
  Matrix result = Matrix::Zero();
  for (int bcol_idx = 0; bcol_idx < dimension; ++bcol_idx) {
    for (int brow_idx = 0; brow_idx < dimension; ++brow_idx) {
      result.col(bcol_idx).noalias() += a.col(brow_idx) * b(brow_idx, bcol_idx);
    }
  }
  return result;
}

Matrix PairwiseMultiplyEachMatrixNoAlias(int num_repetitions, const Eigen_std_vector& input) {
  Matrix acc = Matrix::Zero();
  for (int i = 0; i < num_repetitions; ++i) {
    for (const auto& matrix_a : input) {
      for (const auto& matrix_b : input) {
        acc.noalias() += matrix_a * matrix_b;
      }
    }
  }
  return acc;
}

Matrix PairwiseMultiplyEachMatrixCustom(int num_repetitions, const Eigen_std_vector& input) {
  Matrix acc = Matrix::Zero();
  for (int i = 0; i < num_repetitions; ++i) {
    for (const auto& matrix_a : input) {
      for (const auto& matrix_b : input) {
        acc.noalias() += CustomMultiply(matrix_a, matrix_b);
      }
    }
  }
  return acc;
}

void BM_PairwiseMultiplyEachMatrixNoAlias(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(num_random_matrices);
  for (auto _ : state) {
    benchmark::DoNotOptimize(PairwiseMultiplyEachMatrixNoAlias(1, random_matrices));
  }
}
BENCHMARK(BM_PairwiseMultiplyEachMatrixNoAlias);


void BM_PairwiseMultiplyEachMatrixCustom(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(num_random_matrices);
  for (auto _ : state) {
    benchmark::DoNotOptimize(PairwiseMultiplyEachMatrixCustom(1, random_matrices));
  }
}
BENCHMARK(BM_PairwiseMultiplyEachMatrixCustom);

void BM_MultiplySingle(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(2);
  for (auto _ : state) {
    benchmark::DoNotOptimize((random_matrices[0] * random_matrices[1]).eval());
  }
}
BENCHMARK(BM_MultiplySingle);

void BM_MultiplySingleCustom(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(2);
  for (auto _ : state) {
    benchmark::DoNotOptimize(CustomMultiply(random_matrices[0], random_matrices[1]));
  }
}
BENCHMARK(BM_MultiplySingleCustom);



double TestCustom() {
  const Matrix a = Matrix::Random();
  const Matrix b = Matrix::Random();

  const Matrix c = a * b;
  const Matrix custom_c = CustomMultiply(a, b);

  const double err = (c - custom_c).squaredNorm();
  return err;
}

// Just sanity check the multiplication
void BM_TestCustom(benchmark::State& state) {
  if (TestCustom() > 1e-10) {
    exit(-1);
  }
}
BENCHMARK(BM_TestCustom);

这会产生以下神秘报道

Run on (20 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 25600K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_PairwiseMultiplyEachMatrixNoAlias      28283 ns      28285 ns      20250
BM_PairwiseMultiplyEachMatrixCustom       14442 ns      14443 ns      48488
BM_MultiplySingle                           791 ns        791 ns     876969
BM_MultiplySingleCustom                     874 ns        874 ns     802052
BM_TestCustom                                 0 ns          0 ns          0

我目前的假设是,速度下降归因于指令高速缓存未命中.Eigen的矩阵乘法功能可能会对指令缓存造成不良影响.

My current hypothesis is that the slowdown is attributable to instruction cache misses. It's possible Eigen's matrix multiply function does bad things to the instruction cache.

用于自定义的VTune输出:

VTune output for custom:

本征的VTune输出:

VTune output for Eigen:

也许有更多VTune经验的人可以告诉我我是否正确解释了这个结果.DSB是已解码的指令高速缓存,而MITE与指令解码器的带宽有关.Eigen版本显示,大多数指令都缺少DSB(66%的未命中率),并且由于MITE带宽而导致停滞现象明显增加.

Maybe someone with more experience with VTune can tell me if I am interpreting this result correctly. The DSB is the decoded instruction cache and MITE has something to do with instruction decoder bandwidth. The Eigen version shows that most instructions are missing the DSB (66% miss rate) and a marked increase in stalling due to MITE bandwidth.

更新3:在收到有关自定义的单个版本速度更快的报告后,我还在计算机上复制了该版本.这与@dtell在其计算机上的原始发现背道而驰.

Update 3: After getting reports that the single version of custom was faster, I also reproduced it on my machine. This goes against @dtell's original findings on their machine.

CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 25600K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_PairwiseMultiplyEachMatrixNoAlias      34787 ns      34789 ns      16477
BM_PairwiseMultiplyEachMatrixCustom       17901 ns      17902 ns      37759
BM_MultiplySingle                           349 ns        349 ns    2054295
BM_MultiplySingleCustom                     178 ns        178 ns    4624183
BM_TestCustom                                 0 ns          0 ns          0

我想知道在我以前的基准测试结果中是否没有优化标记.无论如何,我认为这个问题可以肯定的是,在将小矩阵相乘时,本征会产生开销.如果外面有人使用的计算机不使用uop缓存,那么我想看看这种减速是否不太严重.

I wonder if in my previous benchmark result I had left out an optimization flag. In any case, I think the issue is confirmed that Eigen incurs an overhead when multiplying small matrices. If anyone out there has a machine that does not use a uop cache, I would be interested in seeing if the slowdown is less severe.

推荐答案

(gdb) bt
#0  0x00005555555679e3 in Eigen::internal::gemm_pack_rhs<double, long, Eigen::internal::const_blas_data_mapper<double, long, 0>, 4, 0, false, false>::operator()(double*, Eigen::internal::const_blas_data_mapper<double, long, 0> const&, long, long, long, long) ()
#1  0x0000555555566654 in Eigen::internal::general_matrix_matrix_product<long, double, 0, false, double, 0, false, 0>::run(long, long, long, double const*, long, double const*, long, double*, long, double, Eigen::internal::level3_blocking<double, double>&, Eigen::internal::GemmParallelInfo<long>*) ()
#2  0x0000555555565822 in BM_PairwiseMultiplyEachMatrixNoAlias(benchmark::State&) ()
#3  0x000055555556d571 in benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::Benchmark::Instance const*, unsigned long, int, benchmark::internal::ThreadManager*) ()
#4  0x000055555556b469 in benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) ()
#5  0x000055555556a450 in main ()

从堆栈跟踪中,本征矩阵乘法是使用通用乘法方法并在动态矩阵大小之间循环.对于自定义实现,clang积极地对其进行矢量化并展开循环,因此分支少得多.

From stack trace, eigen's matrix multiplication is using a generic multiply method and loop through a dynamic matrix size. For custom implementation, clang aggressively vectorize it and unroll loop, so there's much less branching.

也许特征码有一些标记/选项,可以针对此特定大小生成代码进行优化.

Maybe there's some flag/option for eigen to generate code for this particular size to optimize.

但是,如果矩阵尺寸更大,则Eigen版本的性能将比自定义更好.

However, if the matrix size is bigger, the Eigen version will perform much better than custom.

这篇关于Eigen乘以小矩阵的速度慢吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆