通过std :: vector进行矩阵乘法的速度比numpy慢10倍 [英] Matrix multiplication via std::vector is 10 times slower than numpy

查看：95 发布时间：2020/5/7 19:25:17 performance matrix vector scientific-computing

本文介绍了通过std :: vector进行矩阵乘法的速度比numpy慢10倍的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

尽管众所周知，使用嵌套的std::vector表示矩阵是

Although it is known that using nested std::vector to represent matrices is a bad idea, let's use it for now since it is flexible and many existing functions can handle std::vector.

我认为，在很小的情况下，速度差异可以忽略不计.但是事实证明，vector<vector<double>>比numpy.dot()慢了10倍.

I thought, in small cases, the speed difference can be ignored. But it turned out that vector<vector<double>> is 10+ times slower than numpy.dot().

让A和B是大小为size x size的矩阵.假设平方矩阵只是为了简单起见. (我们不打算将讨论限于平方矩阵的情况.)我们以确定性的方式初始化每个矩阵，最后计算C = A * B.

Let A and B be matrices whose size is sizexsize. Assuming square matrices is just for simplicity. (We don't intend to limit discussion to the square matrices case.) We initialize each matrix in a deterministic way, and finally calculate C = A * B.

我们将计算时间"定义为仅用于计算C = A * B的时间.换句话说，不包括各种间接费用.

We define "calculation time" as the time elapsed just to calculate C = A * B. In other words, various overheads are not included.

Python3代码

import numpy as np
import time
import sys

if (len(sys.argv) != 2):
    print("Pass `size` as an argument.", file = sys.stderr);
    sys.exit(1);
size = int(sys.argv[1]);

A = np.ndarray((size, size));
B = np.ndarray((size, size));

for i in range(size):
    for j in range(size):
        A[i][j] = i * 3.14 + j
        B[i][j] = i * 3.14 - j

start = time.time()
C = np.dot(A, B);
print("{:.3e}".format(time.time() - start), file = sys.stderr);

C ++代码

using namespace std;
#include <iostream>
#include <vector>
#include <chrono>

int main(int argc, char **argv) {

    if (argc != 2) {
        cerr << "Pass `size` as an argument.\n";
        return 1;
    }
    const unsigned size = atoi(argv[1]);

    vector<vector<double>> A(size, vector<double>(size));
    vector<vector<double>> B(size, vector<double>(size));

    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            A[i][j] = i * 3.14 + j;
            B[i][j] = i * 3.14 - j;
        }
    }

    auto start = chrono::system_clock::now();

    vector<vector<double>> C(size, vector<double>(size, /* initial_value = */ 0));
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }

    cerr << scientific;
    cerr.precision(3);
    cerr << chrono::duration<double>(chrono::system_clock::now() - start).count() << "\n";

}

C ++代码(多线程)

我们还编写了C ++代码的多线程版本，因为 numpy.dot()在平行.

We also wrote a multithreaded version of C++ code since numpy.dot() is automatically calculated in parallel.

您可以从 GitHub 中获取所有代码.

You can get all the codes from GitHub.

结果

C++版本比Python 3(带有numpy)版本慢10倍以上.

C++ version is 10+ times slower than Python 3 (with numpy) version.

matrix_size: 200x200
--------------- Time in seconds ---------------
C++ (not multithreaded): 8.45e-03
         C++ (1 thread): 8.66e-03
        C++ (2 threads): 4.68e-03
        C++ (3 threads): 3.14e-03
        C++ (4 threads): 2.43e-03
               Python 3: 4.07e-04
-----------------------------------------------

matrix_size: 400x400
--------------- Time in seconds ---------------
C++ (not multithreaded): 7.011e-02
         C++ (1 thread): 6.985e-02
        C++ (2 threads): 3.647e-02
        C++ (3 threads): 2.462e-02
        C++ (4 threads): 1.915e-02
               Python 3: 1.466e-03
-----------------------------------------------

问题

有什么方法可以使C ++实现更快?

Is there any way to make the C++ implementation faster?

交换计算顺序->最多快3.5倍(不是numpy代码，而是C ++代码)

swap calculation order -> at most 3.5 times faster (not than numpy code but than C++ code)

优化1加上部分展开->最多快4.5倍，但~~仅在预先知道size时执行~~否.如我的实现例子.

optimization 1 plus partial unroll -> at most 4.5 times faster, ~~but this can be done only when size is known in advance~~ No. As pointed out in this comment, size is not needed to be known. We can just limit the max value of loop variables of unrolled loops and process remaining elements with normal loops. See my implementation for example.

优化2，以及通过引入简单变量sum->最多快5.2倍来最小化C[i][j]的调用.该实现是此处 .此结果表明std::vector::operator[]的速度非常慢.

optimization 2, plus minimizing the call of C[i][j] by introducing a simple variable sum -> at most 5.2 times faster. The implementation is here. This result implies std::vector::operator[] is un-ignorably slow.

优化3，加上g ++ -march=native标志->最多快6.2倍(顺便说一下，我们当然使用-O3.)

optimization 3, plus g++ -march=native flag -> at most 6.2 times faster (By the way, we use -O3 of course.)

优化3，并通过引入指向A元素的指针来减少对运算符[]的调用，因为A的元素在展开循环中被顺序访问. ->最多比优化4快6.2倍，并且快一点.下面显示代码.

Optimization 3, plus reducing the call of operator [] by introducing a pointer to an element of A since A's elements are sequentially accessed in the unrolled loop. -> At most 6.2 times faster, and a little little bit faster than Optimization 4. The code is shown below.

g ++ -funroll-loops标志以展开for循环->不变

g++ -funroll-loops flag to unroll for loops -> no change

g ++ #pragma GCC unroll n ->不变

g++ #pragma GCC unroll n -> no change

g ++ -flto标志打开链接时间优化->不变

g++ -flto flag to turn on link time optimizations -> no change

阻止算法->不变

转置B以避免缓存丢失->不变

transpose B to avoid cache miss -> no change

长线性std::vector代替嵌套std::vector<std::vector>，交换计算顺序，块算法和部分展开->最多快2.2倍

long linear std::vector instead of nested std::vector<std::vector>, swap calculation order, block algorithm, and partial unroll -> at most 2.2 times faster

优化1，加上

Optimization 1, plus PGO(profile-guided optimization) -> 4.7 times faster

优化3，加上PGO->与优化3相同

Optimization 3, plus PGO -> same as Optimization 3

优化3，以及特定于g ++的 __builtin_prefetch() ->与优化3相同

Optimization 3, plus g++ specific __builtin_prefetch() -> same as Optimization 3

当前状态

(原本)13.06慢--(目前)2.10慢

Current Status

(originally) 13.06 times slower -> (currently) 2.10 times slower

同样，您可以在 GitHub 上获取所有代码.但是让我们引用一些代码，所有这些代码都是从C ++代码的多线程版本调用的函数.

Again, you can get all the codes on GitHub. But let us cite some codes, all of which are functions called from the multithreaded version of C++ code.

原始代码( GitHub )

void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
    const unsigned j_max = B[0].size();
    const unsigned k_max = B.size();
    for (int i = row_start; i < row_end; ++i) {
        for (int j = 0; j < j_max; ++j) {
            for (int k = 0; k < k_max; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

当前最佳代码(这是上面优化5的实现.

This is the implementation of the Optimization 5 above.

void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {

    static const unsigned num_unroll = 5;

    const unsigned j_max = B[0].size();
    const unsigned k_max_for_unrolled_loop = B.size() / num_unroll * num_unroll;
    const unsigned k_max = B.size();

    for (int i = row_start; i < row_end; ++i) {
        for (int k = 0; k < k_max_for_unrolled_loop; k += num_unroll) {
            for (int j = 0; j < j_max; ++j) {
                const double *p = A[i].data() + k;
                double sum;
                sum = *p++ * B[k][j];
                sum += *p++ * B[k+1][j];
                sum += *p++ * B[k+2][j];
                sum += *p++ * B[k+3][j];
                sum += *p++ * B[k+4][j];
                C[i][j] += sum;
            }
        }
        for (int k = k_max_for_unrolled_loop; k < k_max; ++k) {
            const double a = A[i][k];
            for (int j = 0; j < j_max; ++j) {
                C[i][j] += a * B[k][j];
            }
        }
    }

}

自我们首次发布此问题以来，我们已经尝试了许多优化.我们花了整整两天的时间来解决这个问题，终于到了我们不知道如何优化当前最佳代码的地步.我们怀疑更复杂的算法，例如 Strassen算法会做得更好，因为我们处理的案例并不大如我们所见，对std::vector进行的操作非常昂贵，只需减少对[]的调用即可很好地改善性能.

We've tried many optimizations since we first posted this question. We spent whole two days struggling with this problem, and finally reached the point where we have no more idea how to optimize the current best code. We doubt more complex algorithms like Strassen's will do it better since cases we handle are not large and each operation on std::vector is so expensive that, as we've seen, just reducing the call of [] improved the performance well.

我们(希望)相信我们可以做得更好.

We (want to) believe we can make it better, though.

通过std :: vector进行矩阵乘法的速度比numpy慢10倍 [英] Matrix multiplication via std::vector is 10 times slower than numpy

问题描述

当前状态

Current Status

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

通过std :: vector进行矩阵乘法的速度比numpy慢10倍 [英] Matrix multiplication via std::vector is 10 times slower than numpy

问题描述

当前状态

Current Status

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭