通过std :: vector进行矩阵乘法的速度比numpy慢10倍 [英] Matrix multiplication via std::vector is 10 times slower than numpy

查看:95
本文介绍了通过std :: vector进行矩阵乘法的速度比numpy慢10倍的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尽管众所周知,使用嵌套的std::vector表示矩阵是

Although it is known that using nested std::vector to represent matrices is a bad idea, let's use it for now since it is flexible and many existing functions can handle std::vector.

我认为,在很小的情况下,速度差异可以忽略不计.但是事实证明,vector<vector<double>>numpy.dot()慢了10倍.

I thought, in small cases, the speed difference can be ignored. But it turned out that vector<vector<double>> is 10+ times slower than numpy.dot().

AB是大小为size x size的矩阵.假设平方矩阵只是为了简单起见. (我们不打算将讨论限于平方矩阵的情况.)我们以确定性的方式初始化每个矩阵,最后计算C = A * B.

Let A and B be matrices whose size is sizexsize. Assuming square matrices is just for simplicity. (We don't intend to limit discussion to the square matrices case.) We initialize each matrix in a deterministic way, and finally calculate C = A * B.

我们将计算时间"定义为仅用于计算C = A * B的时间.换句话说,不包括各种间接费用.

We define "calculation time" as the time elapsed just to calculate C = A * B. In other words, various overheads are not included.

Python3代码

import numpy as np
import time
import sys

if (len(sys.argv) != 2):
    print("Pass `size` as an argument.", file = sys.stderr);
    sys.exit(1);
size = int(sys.argv[1]);

A = np.ndarray((size, size));
B = np.ndarray((size, size));

for i in range(size):
    for j in range(size):
        A[i][j] = i * 3.14 + j
        B[i][j] = i * 3.14 - j

start = time.time()
C = np.dot(A, B);
print("{:.3e}".format(time.time() - start), file = sys.stderr);

C ++代码

using namespace std;
#include <iostream>
#include <vector>
#include <chrono>

int main(int argc, char **argv) {

    if (argc != 2) {
        cerr << "Pass `size` as an argument.\n";
        return 1;
    }
    const unsigned size = atoi(argv[1]);

    vector<vector<double>> A(size, vector<double>(size));
    vector<vector<double>> B(size, vector<double>(size));

    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            A[i][j] = i * 3.14 + j;
            B[i][j] = i * 3.14 - j;
        }
    }

    auto start = chrono::system_clock::now();

    vector<vector<double>> C(size, vector<double>(size, /* initial_value = */ 0));
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }

    cerr << scientific;
    cerr.precision(3);
    cerr << chrono::duration<double>(chrono::system_clock::now() - start).count() << "\n";

}

C ++代码(多线程)

我们还编写了C ++代码的多线程版本,因为 numpy.dot()在平行.

We also wrote a multithreaded version of C++ code since numpy.dot() is automatically calculated in parallel.

您可以从 GitHub 中获取所有代码.

You can get all the codes from GitHub.

结果

C++版本比Python 3(带有numpy)版本慢10倍以上.

C++ version is 10+ times slower than Python 3 (with numpy) version.

matrix_size: 200x200
--------------- Time in seconds ---------------
C++ (not multithreaded): 8.45e-03
         C++ (1 thread): 8.66e-03
        C++ (2 threads): 4.68e-03
        C++ (3 threads): 3.14e-03
        C++ (4 threads): 2.43e-03
               Python 3: 4.07e-04
-----------------------------------------------

matrix_size: 400x400
--------------- Time in seconds ---------------
C++ (not multithreaded): 7.011e-02
         C++ (1 thread): 6.985e-02
        C++ (2 threads): 3.647e-02
        C++ (3 threads): 2.462e-02
        C++ (4 threads): 1.915e-02
               Python 3: 1.466e-03
-----------------------------------------------

问题

有什么方法可以使C ++实现更快?

Is there any way to make the C++ implementation faster?

  1. 交换计算顺序->最多快3.5倍(不是numpy代码,而是C ++代码)

  1. swap calculation order -> at most 3.5 times faster (not than numpy code but than C++ code)

优化1加上部分展开->最多快4.5倍,但仅在预先知道size时执行否.如我的实现例子.

optimization 1 plus partial unroll -> at most 4.5 times faster, but this can be done only when size is known in advance No. As pointed out in this comment, size is not needed to be known. We can just limit the max value of loop variables of unrolled loops and process remaining elements with normal loops. See my implementation for example.

优化2,以及通过引入简单变量sum->最多快5.2倍来最小化C[i][j]的调用.该实现是此处 .此结果表明std::vector::operator[]的速度非常慢.

optimization 2, plus minimizing the call of C[i][j] by introducing a simple variable sum -> at most 5.2 times faster. The implementation is here. This result implies std::vector::operator[] is un-ignorably slow.

优化3,加上g ++ -march=native标志->最多快6.2倍(顺便说一下,我们当然使用-O3.)

optimization 3, plus g++ -march=native flag -> at most 6.2 times faster (By the way, we use -O3 of course.)

优化3,并通过引入指向A元素的指针来减少对运算符[]的调用,因为A的元素在展开循环中被顺序访问. ->最多比优化4快6.2倍,并且快一点.下面显示代码.

Optimization 3, plus reducing the call of operator [] by introducing a pointer to an element of A since A's elements are sequentially accessed in the unrolled loop. -> At most 6.2 times faster, and a little little bit faster than Optimization 4. The code is shown below.

g ++ -funroll-loops标志以展开for循环->不变

g++ -funroll-loops flag to unroll for loops -> no change

g ++ #pragma GCC unroll n ->不变

g++ #pragma GCC unroll n -> no change

g ++ -flto标志打开链接时间优化->不变

g++ -flto flag to turn on link time optimizations -> no change

阻止算法->不变

转置B以避免缓存丢失->不变

transpose B to avoid cache miss -> no change

长线性std::vector代替嵌套std::vector<std::vector>,交换计算顺序,块算法和部分展开->最多快2.2倍

long linear std::vector instead of nested std::vector<std::vector>, swap calculation order, block algorithm, and partial unroll -> at most 2.2 times faster

优化1,加上

Optimization 1, plus PGO(profile-guided optimization) -> 4.7 times faster

优化3,加上PGO->与优化3相同

Optimization 3, plus PGO -> same as Optimization 3

优化3,以及特定于g ++的 __builtin_prefetch() ->与优化3相同

Optimization 3, plus g++ specific __builtin_prefetch() -> same as Optimization 3


当前状态

(原本)13.06慢--(目前)2.10


Current Status

(originally) 13.06 times slower -> (currently) 2.10 times slower

同样,您可以在 GitHub 上获取所有代码.但是让我们引用一些代码,所有这些代码都是从C ++代码的多线程版本调用的函数.

Again, you can get all the codes on GitHub. But let us cite some codes, all of which are functions called from the multithreaded version of C++ code.

原始代码( GitHub )

void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
    const unsigned j_max = B[0].size();
    const unsigned k_max = B.size();
    for (int i = row_start; i < row_end; ++i) {
        for (int j = 0; j < j_max; ++j) {
            for (int k = 0; k < k_max; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

当前最佳代码(这是上面优化5的实现.

This is the implementation of the Optimization 5 above.

void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {

    static const unsigned num_unroll = 5;

    const unsigned j_max = B[0].size();
    const unsigned k_max_for_unrolled_loop = B.size() / num_unroll * num_unroll;
    const unsigned k_max = B.size();

    for (int i = row_start; i < row_end; ++i) {
        for (int k = 0; k < k_max_for_unrolled_loop; k += num_unroll) {
            for (int j = 0; j < j_max; ++j) {
                const double *p = A[i].data() + k;
                double sum;
                sum = *p++ * B[k][j];
                sum += *p++ * B[k+1][j];
                sum += *p++ * B[k+2][j];
                sum += *p++ * B[k+3][j];
                sum += *p++ * B[k+4][j];
                C[i][j] += sum;
            }
        }
        for (int k = k_max_for_unrolled_loop; k < k_max; ++k) {
            const double a = A[i][k];
            for (int j = 0; j < j_max; ++j) {
                C[i][j] += a * B[k][j];
            }
        }
    }

}

自我们首次发布此问题以来,我们已经尝试了许多优化.我们花了整整两天的时间来解决这个问题,终于到了我们不知道如何优化当前最佳代码的地步.我们怀疑更复杂的算法,例如 Strassen算法会做得更好,因为我们处理的案例并不大如我们所见,对std::vector进行的操作非常昂贵,只需减少对[]的调用即可很好地改善性能.

We've tried many optimizations since we first posted this question. We spent whole two days struggling with this problem, and finally reached the point where we have no more idea how to optimize the current best code. We doubt more complex algorithms like Strassen's will do it better since cases we handle are not large and each operation on std::vector is so expensive that, as we've seen, just reducing the call of [] improved the performance well.

我们(希望)相信我们可以做得更好.

We (want to) believe we can make it better, though.

推荐答案

矩阵乘法相对容易优化.但是,如果要达到不错的cpu利用率,它将变得很棘手,因为您需要对所使用的硬件有深入的了解.实现快速Matmul内核的步骤如下:

Matrix multiplication is relativly easy to optimize. However if you want to get to decent cpu utilization it becomes tricky because you need deep knowledge of the hardware you are using. The steps to implement a fast matmul kernel are the following:

  1. 使用SIMD指令
  2. 使用寄存器阻止并一次获取多个数据
  3. 针对您的车队线(主要是L2和L3)进行优化
  4. 并行化代码以使用多个线程

在此链接下,它是一个很好的资源,它解释了所有令人讨厌的细节: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0

Under this linke is a very good ressource, that explains all the nasty details: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0

如果您想更深入地建议,请发表评论.

If you want more indepth advise leave a comment.

这篇关于通过std :: vector进行矩阵乘法的速度比numpy慢10倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆