比较Python,Numpy,Numba和C ++进行矩阵乘法 [英] Comparing Python, Numpy, Numba and C++ for matrix multiplication

查看:4284
本文介绍了比较Python,Numpy,Numba和C ++进行矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我正在开发的程序中,我需要重复乘以两个矩阵。由于其中一个矩阵的大小,这个操作需要一些时间,我想知道哪个方法是最有效的。矩阵具有(mxn)*(nxp)的尺寸,其中 m = n = 3 10 ^ 5< p < 10 ^ 6



除了Numpy,我假设使用优化的算法,每个测试包括一个简单的实现



以下是我的各种实现:



Python

  def dot_py(A,B):
m,n = A.shape
p = B.shape [1]

C = np.zeros((m,p))

for i in range(0,m):
for j in range 0,p):
用于范围(0,n)中的k:
C [i,j] + = A [i,k] * B [k,j]

Numpy

  def dot_np(A,B):
C = np.dot(A,B)
return C



Numba



代码与Python一样,但它是在使用之前的时间被编译:

  dot_nb = nb.jit(nb.float64 [:,:] (nb.float64 [:,:],nb.float64 [:,:]),nopython = True)(dot_py)


b $ b

到目前为止,每个方法调用都使用 timeit 模块定时10次。最好的结果是保留。使用 np.random.rand(n,m)创建矩阵。



C ++

  mat2 dot(const mat2& m1,const mat2& m2)
{
int m = m1.rows_;
int n = m1.cols_;
int p = m2.cols_;

mat2 m3(m,p);

for(int row = 0; row< m; row ++){
for(int col = 0; col< p; col ++){
for = 0; k m3.data_ [p * row + col] + = m1.data_ [n * row + k] * m2.data_ [p * k + col]
}
}
}

return m3;
}

这里, mat2 是我定义的自定义类, dot(const mat2& m1,const mat2& m2)是这个类的一个朋友函数。它是使用 Windows 从 QPF QPC 并使用带有 g ++ 命令的MinGW编译程序。





如预期的那样,简单的Python代码较慢,但对于非常小的矩阵,它仍然对Numpy有效。



我对C ++的结果感到惊讶,其中乘法几乎需要一个数量级的时间比Numba。事实上,我预计这些都需要相似的时间。



这导致我的主要问题:这是正常的,如果不是,为什么C ++慢的Numba?我刚开始学习C ++,所以我可能做错了。如果是这样,我的错误是什么,或者我可以做什么来提高我的代码的效率(除了选择更好的算法)?





这是 mat2 类的标题。

  #ifndef MAT2_H 
#define MAT2_H

#include< iostream>

class mat2
{
private:
int rows_,cols_;
float * data_;

public:
mat2(){} //(默认)构造函数
mat2(int rows,int cols,float value = 0); // constructor
mat2(const mat2& other); // copy constructor
〜mat2(); // destructor

//运算符
mat2& operator =(mat2 other); //赋值运算符

float operator()(int row,int col)const;
float& operator()(int row,int col);

mat2 operator *(const mat2& other);

//操作
朋友mat2 dot(const mat2& m1,const mat2& m2);

//其他
好​​友void swap(mat2& first,mat2& second);
friend std :: ostream& operator<<(std :: ostream& os,const mat2& M);
};

#endif

编辑2 p>

正如许多建议,使用优化标志是缺少匹配Numba的元素。下面是与之前的曲线相比的新曲线。通过切换两个内循环获得标记为 v2 的曲线,并显示出另一个30%至50%的改善。



解决方案

确定使用 -O3 进行优化。这会打开矢量化,这应该会显着加快您的代码。



Numba应该这样做。


In a program I am working on, I need to multiply two matrices repeatedly. Because of the size of one of the matrices, this operation takes some time and I wanted to see which method would be the most efficient. The matrices have dimensions (m x n)*(n x p) where m = n = 3 and 10^5 < p < 10^6.

With the exception of Numpy, which I assume works with an optimized algorithm, every test consists of a simple implementation of the matrix multiplication:

Below are my various implementations:

Python

def dot_py(A,B):
    m, n = A.shape
    p = B.shape[1]

    C = np.zeros((m,p))

    for i in range(0,m):
        for j in range(0,p):
            for k in range(0,n):
                C[i,j] += A[i,k]*B[k,j] 
    return C

Numpy

def dot_np(A,B):
    C = np.dot(A,B)
    return C

Numba

The code is the same as the Python one, but it is compiled just in time before being used:

dot_nb = nb.jit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:]), nopython = True)(dot_py)

So far, each method call has been timed using the timeit module 10 times. The best result is kept. The matrices are created using np.random.rand(n,m).

C++

mat2 dot(const mat2& m1, const mat2& m2)
{
    int m = m1.rows_;
    int n = m1.cols_;
    int p = m2.cols_;

    mat2 m3(m,p);

    for (int row = 0; row < m; row++) {
        for (int col = 0; col < p; col++) {
            for (int k = 0; k < n; k++) {
                m3.data_[p*row + col] += m1.data_[n*row + k]*m2.data_[p*k + col];
            }
        }
    }

    return m3;
}

Here, mat2 is a custom class that I defined and dot(const mat2& m1, const mat2& m2) is a friend function to this class. It is timed using QPF and QPC from Windows.h and the program is compiled using MinGW with the g++ command. Again, the best time obtained from 10 executions is kept.

Results

As expected, the simple Python code is slower but it still beats Numpy for very small matrices. Numba turns out to be about 30% faster than Numpy for the largest cases.

I am surprised with the C++ results, where the multiplication takes almost an order of magnitude more time than with Numba. In fact, I expected these to take a similar amount of time.

This leads to my main question: Is this normal and if not, why is C++ slower that Numba? I just started learning C++ so I might be doing something wrong. If so, what would be my mistake, or what could I do to improve the efficiency of my code (other than choosing a better algorithm) ?

EDIT 1

Here is the header of the mat2 class.

#ifndef MAT2_H
#define MAT2_H

#include <iostream>

class mat2
{
private:
    int rows_, cols_;
    float* data_;

public: 
    mat2() {}                                   // (default) constructor
    mat2(int rows, int cols, float value = 0);  // constructor
    mat2(const mat2& other);                    // copy constructor
    ~mat2();                                    // destructor

    // Operators
    mat2& operator=(mat2 other);                // assignment operator

    float operator()(int row, int col) const;
    float& operator() (int row, int col);

    mat2 operator*(const mat2& other);

    // Operations
    friend mat2 dot(const mat2& m1, const mat2& m2);

    // Other
    friend void swap(mat2& first, mat2& second);
    friend std::ostream& operator<<(std::ostream& os, const mat2& M);
};

#endif

Edit 2

As many suggested, using the optimization flag was the missing element to match Numba. Below are the new curves compared to the previous ones. The curve tagged v2 was obtained by switching the two inner loops and shows another 30% to 50% improvement.

解决方案

Definitely use -O3 for optimization. This turns vectorizations on, which should significantly speed your code up.

Numba is supposed to do that already.

这篇关于比较Python,Numpy,Numba和C ++进行矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆