性能:Matlab与C ++矩阵向量乘法 [英] Performance: Matlab vs C++ Matrix vector multiplication

查看:77
本文介绍了性能:Matlab与C ++矩阵向量乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

序言

前段时间,我问了一个关于Matlab与Python性能的问题(性能:Matlab与Python ).我感到惊讶的是Matlab比Python快,尤其是在 meshgrid 中.在讨论该问题时,有人指出我应该在Python中使用包装器来调用我的C ++代码,因为我也可以使用C ++代码.我在C ++,Matlab和Python中有相同的代码.

Some time ago I asked a question about performance of Matlab vs Python (Performance: Matlab vs Python). I was surprised that Matlab is faster than Python, especially in meshgrid. In the discussion of that question, it was pointed to me that I should use a wrapper in Python to call my C++ code because C++ code is also available to me. I have the same code in C++, Matlab and Python.

这样做的时候,我再次感到惊讶,发现Matlab在矩阵汇编中比C ++更快.我有一个稍大的代码,正在研究一段矩阵,向量乘法.较大的代码在多个实例上执行此类乘法.总体而言,C ++中的代码要比Matlab快得多(因为Matlab中的函数调用会产生开销等),但是Matlab在矩阵矢量乘法方面的性能似乎优于C ++(底部的代码段).

While doing that, I was surprised once again to find that Matlab is faster than C++ in matrix assembly and computation.I have a slightly larger code, from which I am investigating a segment of matrix-vector multiplication. The larger code performs such multiplications at multiple instances. Overall the code in C++ is much much faster than Matlab (because function calling in Matlab has an overhead etc.), but Matlab seems to be outperforming C++ in the matrix-vector multiplication (code snippet at the bottom).

结果

下表显示了组装内核矩阵所花费的时间与将矩阵与向量相乘所花费的时间的比较.将针对矩阵大小为 NxN 的结果进行编译,其中 N 从10,000到40,000不等.哪那么大.但是有趣的是, N 越大,Matlab的性能就优于C ++.Matlab的总​​时间快了3.8-5.8倍.此外,它在矩阵装配计算中也更快.

The table below shows the comparison of time it takes to assemble the kernel matrix and the time it takes to multiply the matrix with the vector. The results are compiled for a matrix size NxN where N varies from 10,000 to 40,000. Which is not that large. But the interesting thing is that Matlab outperforms C++ the larger the N gets. Matlab is 3.8 - 5.8 times faster in total time. Moreover it is also faster in both matrix assembly and computation.

 ___________________________________________
|N=10,000   Assembly    Computation  Total  |
|MATLAB     0.3387      0.031        0.3697 |
|C++        1.15        0.24         1.4    |
|Times faster                        3.8    |
 ___________________________________________ 
|N=20,000   Assembly    Computation  Total  |
|MATLAB     1.089       0.0977       1.187  |
|C++        5.1         1.03         6.13   |
|Times faster                        5.2    |
 ___________________________________________
|N=40,000   Assembly    Computation  Total  |
|MATLAB     4.31        0.348        4.655  |
|C++        23.25       3.91         27.16  |
|Times faster                        5.8    |
 -------------------------------------------

问题

在C ++中是否有更快的方法?我想念什么吗?我了解C ++正在使用 for 循环,但我的理解是Matlab还将在 meshgrid 中进行类似的操作.

Is there a faster way of doing this in C++? Am I missing something? I understand that C++ is using for loops but my understanding is that Matlab will also be doing something similar in meshgrid.

代码段

Matlab代码:

%% GET INPUT DATA FROM DATA FILES ------------------------------------------- %
% Read data from input file
Data       = load('Input/input.txt');
location   = Data(:,1:2);           
charges    = Data(:,3:end);         
N          = length(location);      
m          = size(charges,2);       

%% EXACT MATRIX VECTOR PRODUCT ---------------------------------------------- %
kex1=ex1; 
tic
Q = kex1.kernel_2D(location , location);
fprintf('\n Assembly time: %f ', toc);

tic
potential_exact = Q * charges;
fprintf('\n Computation time: %f \n', toc);

类(使用网状网格):

classdef ex1
    methods 
        function [kernel] = kernel_2D(obj, x,y) 
            [i1,j1] = meshgrid(y(:,1),x(:,1));
            [i2,j2] = meshgrid(y(:,2),x(:,2));
            kernel = sqrt( (i1 - j1) .^ 2 + (i2 - j2) .^2 );
        end
    end       
end

C ++代码:

编辑

使用带有以下标志的make文件进行编译:

Compiled using a make file with following flags:

CC=g++ 
CFLAGS=-c -fopenmp -w -Wall -DNDEBUG -O3 -march=native -ffast-math -ffinite-math-only -I header/ -I /usr/include 
LDFLAGS= -g -fopenmp  
LIB_PATH= 

SOURCESTEXT= src/read_Location_Charges.cpp 
SOURCESF=examples/matvec.cpp
OBJECTSF= $(SOURCESF:.cpp=.o) $(SOURCESTEXT:.cpp=.o)
EXECUTABLEF=./exec/mykernel
mykernel: $(SOURCESF) $(SOURCESTEXT) $(EXECUTABLEF)
$(EXECUTABLEF): $(OBJECTSF)
    $(CC) $(LDFLAGS) $(KERNEL) $(INDEX) $(OBJECTSF) -o $@ $(LIB_PATH)
.cpp.o:
    $(CC) $(CFLAGS) $(KERNEL) $(INDEX) $< -o $@

`

# include"environment.hpp"
using namespace std;
using namespace Eigen;

class ex1 
{
public:
    void kernel_2D(const unsigned long M, double*& x, const unsigned long N,  double*&  y, MatrixXd& kernel)    {   
        kernel  =   MatrixXd::Zero(M,N);
        for(unsigned long i=0;i<M;++i)  {
            for(unsigned long j=0;j<N;++j)  {
                        double X =   (x[0*N+i] - y[0*N+j]) ;
                        double Y =   (x[1*N+i] - y[1*N+j]) ;
                        kernel(i,j) = sqrt((X*X) + (Y*Y));
            }
        }
    }
};

int main()
{
    /* Input ----------------------------------------------------------------------------- */
    unsigned long N = 40000;          unsigned m=1;                   
    double* charges;                  double* location;
    charges =   new double[N * m]();  location =    new double[N * 2]();
    clock_t start;                    clock_t end;
    double exactAssemblyTime;         double exactComputationTime;

    read_Location_Charges ("input/test_input.txt", N, location, m, charges);

    MatrixXd charges_           =   Map<MatrixXd>(charges, N, m);
    MatrixXd Q;
    ex1 Kex1;

    /* Process ------------------------------------------------------------------------ */
    // Matrix assembly
    start = clock();
        Kex1.kernel_2D(N, location, N, location, Q);
    end = clock();
    exactAssemblyTime = double(end-start)/double(CLOCKS_PER_SEC);

    //Computation
    start = clock();
        MatrixXd QH = Q * charges_;
    end = clock();
    exactComputationTime = double(end-start)/double(CLOCKS_PER_SEC);

    cout << endl << "Assembly     time: " << exactAssemblyTime << endl;
    cout << endl << "Computation time: " << exactComputationTime << endl;


    // Clean up
    delete []charges;
    delete []location;
    return 0;
}

推荐答案

如评论中所述,MatLab依靠英特尔的MKL库提供矩阵产品,该库是此类操作最快的库.尽管如此,仅Eigen就应该能够提供类似的性能.为此,请确保使用最新的Eigen(例如3.4)和适当的编译标志来启用AVX/FMA(如果有)和多线程功能:

As said in the comments MatLab relies on Intel's MKL library for matrix products, which is the fastest library for such kind of operations. Nonetheless, Eigen alone should be able to deliver similar performance. To this end, make sure to use latest Eigen (e.g. 3.4), and proper compilation flags to enable AVX/FMA if available and multithreading:

-O3 -DNDEBUG -march=native

由于 charges _ 是向量,因此最好使用 VectorXd ,Eigen知道您需要矩阵向量乘积,而不是矩阵矩阵乘积.

Since charges_ is a vector, better use a VectorXd to Eigen knows that you want a matrix-vector product and not a matrix-matrix one.

如果您拥有Intel的MKL,则还可以让Eigen 使用它来获取如此精确的操作,其性能与MatLab完全相同.

If you have Intel's MKL, then you can also let Eigen uses it to get exact same performance than MatLab for this precise operation.

关于程序集,最好反转两个循环以启用矢量化,然后使用OpenMP启用多线程(添加 -fopenmp 作为编译器标志)以使最外面的循环并行运行,最后可以简化您使用Eigen的代码:

Regarding the assembly, better inverse the two loops to enable vectorization, then enable multithreading with OpenMP (add -fopenmp as compiler flags) to make the outermost loop run in parallel, and finally you can simplify your code using Eigen:

void kernel_2D(const unsigned long M, double* x, const unsigned long N,  double*  y, MatrixXd& kernel)    {
    kernel.resize(M,N);
    auto x0 = ArrayXd::Map(x,M);
    auto x1 = ArrayXd::Map(x+M,M);
    auto y0 = ArrayXd::Map(y,N);
    auto y1 = ArrayXd::Map(y+N,N);
    #pragma omp parallel for
    for(unsigned long j=0;j<N;++j)
      kernel.col(j) = sqrt((x0-y0(j)).abs2() + (x1-y1(j)).abs2());
}

使用多线程,您需要测量挂钟时间.在这里(Haswell的4个物理内核以2.6GHz运行),在N = 20000的情况下,组装时间降至0.36s,矩阵向量乘积所需的时间为0.24s,因此总计0.6s,比MatLab快,而我的CPU似乎更慢比你的.

With multi-threading you need to measure the wall clock time. Here (Haswell with 4 physical cores running at 2.6GHz) the assembly time drops to 0.36s for N=20000, and the matrix-vector products take 0.24s so 0.6s in total that is faster than MatLab whereas my CPU seems to be slower than yours.

这篇关于性能:Matlab与C ++矩阵向量乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆