为什么C ++比带有Boost的python快得多? [英] Why is C++ much faster than python with boost?

查看:77
本文介绍了为什么C ++比带有Boost的python快得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是为Python中的频谱有限元素编写一个小型库,为此,我尝试使用Boost使用C ++库扩展python,以期使我的代码更快.

My goal is to write a small library for spectral finite elements in Python and to that purpose I tried extending python with a C++ library using Boost, with the hope that it would make my code faster.

class Quad {
    public:
        Quad(int, int);
        double integrate(boost::function<double(std::vector<double> const&)> const&);
        double integrate_wrapper(boost::python::object const&);
        std::vector< std::vector<double> > nodes;
        std::vector<double> weights;
};

...

namespace std {
    typedef std::vector< std::vector< std::vector<double> > > cube;
    typedef std::vector< std::vector<double> > mat;
    typedef std::vector<double> vec;
}

...

double Quad::integrate(boost::function<double(vec const&)> const& func) {

    double result = 0.;
    for (unsigned int i = 0; i < nodes.size(); ++i) {
        result += func(nodes[i]) * weights[i];
    }
    return result;
}

// ---- PYTHON WRAPPER ----
double Quad::integrate_wrapper(boost::python::object const& func) {
    std::function<double(vec const&)> lambda;
    switch (this->nodes[0].size()) {
        case 1: lambda = [&func](vec const& v) -> double { return boost::python::extract<double>(func (v[0])); }; break;
        case 2: lambda = [&func](vec const& v) -> double { return boost::python::extract<double>(func(v[0], v[1])); }; break;
        case 3: lambda = [&func](vec const& v) -> double { return boost::python::extract<double>(func(v[0], v[1], v[2])); }; break;
        default: cout << "Dimension must be 1, 2, or 3" << endl; exit(0);
    }
    return integrate(lambda);
}

// ---- EXPOSE TO PYTHON ----
BOOST_PYTHON_MODULE(hermite)
{
    using namespace boost::python;

    class_<std::vec>("double_vector")
        .def(vector_indexing_suite<std::vec>())
        ;

    class_<std::mat>("double_mat")
        .def(vector_indexing_suite<std::mat>())
        ;

    class_<Quad>("Quad", init<int,int>())
        .def("integrate", &Quad::integrate_wrapper)
        .def_readonly("nodes", &Quad::nodes)
        .def_readonly("weights", &Quad::weights)
        ;
}

我比较了三种不同方法的性能,以计算两个函数的积分.这两个功能是:

I compared the performance of three different methods to calculate the integral of two functions. The two functions are:

  • 函数f1(x,y,z) = x*x
  • 一个更难以评估的函数:f2(x,y,z) = np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z)
  • The function f1(x,y,z) = x*x
  • A function that is more difficult to evaluate: f2(x,y,z) = np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z)

使用的方法是:

  1. 从C ++程序调用该库:

  1. Call the library from a C++ program:

double func(vector<double> v) {
    return F1_OR_F2;
}

int main() {
    hermite::Quad quadrature(100, 3);
    double result = quadrature.integrate(func);
    cout << "Result = " << result << endl;
}

  • 从Python脚本调用该库:

  • Call the library from a Python script:

    import hermite
    def function(x, y, z): return F1_OR_F2
    my_quad = hermite.Quad(100, 3)
    result = my_quad.integrate(function)
    

  • 在Python中使用for循环:

    import hermite
    def function(x, y, z): return F1_OR_F2
    my_quad = hermite.Quad(100, 3)
    weights = my_quad.weights
    nodes = my_quad.nodes
    result = 0.
    for i in range(len(weights)):
        result += weights[i] * function(nodes[i][0], nodes[i][1], nodes[i][2])
    

  • 这里是每种方法的执行时间(该时间是使用方法1的time命令以及方法2和3的python模块time进行测量的,而C ++代码是使用Cmake和set (CMAKE_BUILD_TYPE Release))

    Here are the execution times of each of the method (The time was measured using the time command for method 1, and the python module time for methods 2 and 3, and the C++ code was compiled using Cmake and set (CMAKE_BUILD_TYPE Release))

    • 对于f1:

    • 方法1:0.07s user 0.01s system 99% cpu 0.083 total
    • 方法2:0.19秒
    • 方法3:3.06秒

    对于f2:

    • 方法1:0.28s user 0.01s system 99% cpu 0.289 total
    • 方法2:12.47秒
    • 方法3:16.31秒

    基于这些结果,我的问题如下:

    Based on these results, my questions are the following:

    • 为什么第一种方法比第二种方法快得多?

    • Why is the first method so much faster than the second?

    是否可以改进python包装器,使其在方法1和方法2之间达到可比的性能?

    Could the python wrapper be improved to reach comparable performance between methods 1 and 2?

    为什么方法2对方法的集成难度比方法3敏感?

    Why is method 2 more sensitive than method 3 to the difficulty of the function to integrate?

    编辑:我还试图定义一个函数,该函数接受字符串作为参数,将其写入文件,然后继续编译文件并动态加载生成的.so文件:

    EDIT: I also tried to define a function that accepts a string as argument, writes it to a file, and proceeds to compile the file and dynamically load the resulting .so file:

    double Quad::integrate_from_string(string const& function_body) {
    
        // Write function to file
        ofstream helper_file;
        helper_file.open("/tmp/helper_function.cpp");
        helper_file << "#include <vector>\n#include <cmath>\n";
        helper_file << "extern \"C\" double toIntegrate(std::vector<double> v) {\n";
        helper_file << "    return " << function_body << ";\n}";
        helper_file.close();
    
        // Compile file
        system("c++ /tmp/helper_function.cpp -o /tmp/helper_function.so -shared -fPIC");
    
        // Load function dynamically
        typedef double (*vec_func)(vec);
        void *function_so = dlopen("/tmp/helper_function.so", RTLD_NOW);
        vec_func func = (vec_func) dlsym(function_so, "toIntegrate");
        double result = integrate(func);
        dlclose(function_so);
        return result;
    }
    

    它很脏,可能不太便携,所以我很乐意找到更好的解决方案,但是它很好用,并且可以很好地与sympyccode功能配合使用.

    It's quite dirty and probably not very portable, so I'd be happy to find a better solution, but it works well and plays nicely with the ccode function of sympy.

    第二次编辑,我已经使用 Numpy 在纯Python中重写了该函数.

    SECOND EDIT I have rewritten the function in pure Python Using Numpy.

    import numpy as np
    import numpy.polynomial.hermite_e as herm
    import time
    def integrate(function, degrees):
        dim = len(degrees)
        nodes_multidim = []
        weights_multidim = []
        for i in range(dim):
            nodes_1d, weights_1d = herm.hermegauss(degrees[i])
            nodes_multidim.append(nodes_1d)
            weights_multidim.append(weights_1d)
        grid_nodes = np.meshgrid(*nodes_multidim)
        grid_weights = np.meshgrid(*weights_multidim)
        nodes_flattened = []
        weights_flattened = []
        for i in range(dim):
            nodes_flattened.append(grid_nodes[i].flatten())
            weights_flattened.append(grid_weights[i].flatten())
        nodes = np.vstack(nodes_flattened)
        weights = np.prod(np.vstack(weights_flattened), axis=0)
        return np.dot(function(nodes), weights)
    
    def function(v): return F1_OR_F2
    result = integrate(function, [100,100,100])
    print("-> Result = " + str(result) + ", Time = " + str(end-start))
    

    令人惊讶的是(至少对我而言),此方法与纯C ++实现之间在性能上没有显着差异.特别是f1花费0.059s,f2花费0.36s.

    Somewhat surprisingly (at least to me), there is no significant difference in performance between this method and the pure C++ implementation. In particular, it takes 0.059s for f1 and 0.36s for f2.

    推荐答案

    另一种方法

    以一种不太通用的方式可以更轻松地解决您的问题.您可以使用纯python代码编写集成和函数,然后使用numba进行编译.

    In a bit less general way your problem can be solved a lot easier. You could write the integration and the function in pure python code and compile it using numba.

    第一种方法(第一次运行后,每个集成运行0.025秒(I7-4771))

    函数在第一次调用时进行编译,大约需要0.5s

    The funktion is compiled at the first call, this takes about 0.5s

    function_2:

    function_2:

    @nb.njit(fastmath=True)
    def function_to_integrate(x,y,z):
    return np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z)
    

    集成

    @nb.jit(fastmath=True)
    def integrate3(num_int_Points):
      nodes_1d, weights_1d = herm.hermegauss(num_int_Points)
    
      result=0.
    
      for i in range(num_int_Points):
        for j in range(num_int_Points):
          result+=np.sum(function_to_integrate(nodes_1d[i],nodes_1d[j],nodes_1d[:])*weights_1d[i]*weights_1d[j]*weights_1d[:])
    
      return result
    

    测试

    import numpy as np
    import numpy.polynomial.hermite_e as herm
    import numba as nb
    import time
    
    t1=time.time()
    nodes_1d, weights_1d = herm.hermegauss(num_int_Points)
    
    for i in range(100):
      #result = integrate3(nodes_1d,weights_1d,100)
      result = integrate3(100) 
    
    print(time.time()-t1)
    print(result)
    

    第二种方法

    该函数还可以并行运行,当对许多元素进行积分时,高斯点和权重可能仅计算一次.这将导致大约 0.005s 的运行时间.

    The function can also run in parallell, when integrating over many elements the gauss points and weights may be calculated only once. This will result in a runtime of about 0.005s.

    @nb.njit(fastmath=True,parallel=True)
    def integrate3(nodes_1d,weights_1d,num_int_Points):
    
      result=0.
    
      for i in nb.prange(num_int_Points):
        for j in range(num_int_Points):
          result+=np.sum(function_to_integrate(nodes_1d[i],nodes_1d[j],nodes_1d[:])*weights_1d[i]*weights_1d[j]*weights_1d[:])
    
      return result
    

    传递任意函数

    import numpy as np
    import numpy.polynomial.hermite_e as herm
    import numba as nb
    import time
    
    def f(x,y,z):
      return np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z) +np.cos(2*x+2*y+2*z) + x*y + np.exp(-z*z)
    
    def make_integrate3(f):
      f_jit=nb.njit(f,fastmath=True)
      @nb.njit(fastmath=True,parallel=True)
      def integrate_3(nodes_1d,weights_1d,num_int_Points):
          result=0.
          for i in nb.prange(num_int_Points):
            for j in range(num_int_Points):
              result+=np.sum(f_jit(nodes_1d[i],nodes_1d[j],nodes_1d[:])*weights_1d[i]*weights_1d[j]*weights_1d[:])
    
          return result
    
      return integrate_3
    
    
    int_fun=make_integrate3(f)
    num_int_Points=100
    nodes_1d, weights_1d = herm.hermegauss(num_int_Points)
    #Calling it the first time (takes about 1s)
    result = int_fun(nodes_1d,weights_1d,100)
    
    t1=time.time()
    for i in range(100):
      result = int_fun(nodes_1d,weights_1d,100)
    
    print(time.time()-t1)
    print(result)
    

    首次通话后,使用Numba 0.38和0.002s . html"rel =" nofollow noreferrer>英特尔SVML

    After the first call this takes about 0.002s using Numba 0.38 with Intel SVML

    这篇关于为什么C ++比带有Boost的python快得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆