Cython中C ++函数的性能不佳 [英] Poor performance of C++ function in Cython

查看:91
本文介绍了Cython中C ++函数的性能不佳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个C ++函数,可以使用以下代码从Python调用该函数.与运行纯C ++相比,性能只有一半.有没有办法使他们的表现达到同等水平?我用-Ofast -march=native标志编译这两个代码.我不知道我会在哪里损失50%,因为大多数时间应该花在C ++内核上. Cython是否正在制作我可以避免的记忆副本?

I have this C++ function, which I can call from Python with the code below. The performance is only half compared to running pure C++. Is there a way to get their performance at the same level? I compile both codes with -Ofast -march=native flags. I do not understand where I can lose 50%, because most of the time should be spent in the C++ kernel. Is Cython making a memory copy that I can avoid?

namespace diff
{
    void diff_cpp(double* __restrict__ at, const double* __restrict__ a, const double visc,
                  const double dxidxi, const double dyidyi, const double dzidzi,
                  const int itot, const int jtot, const int ktot)
    {
        const int ii = 1;
        const int jj = itot;
        const int kk = itot*jtot;

        for (int k=1; k<ktot-1; k++)
            for (int j=1; j<jtot-1; j++)
                for (int i=1; i<itot-1; i++)
                {
                    const int ijk = i + j*jj + k*kk;
                    at[ijk] += visc * (
                            + ( (a[ijk+ii] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-ii]) ) * dxidxi 
                            + ( (a[ijk+jj] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-jj]) ) * dyidyi
                            + ( (a[ijk+kk] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-kk]) ) * dzidzi
                            );
                }
    }
}

我有这个.pyx文件

# import both numpy and the Cython declarations for numpy
import cython
import numpy as np
cimport numpy as np

# declare the interface to the C code
cdef extern from "diff_cpp.cpp" namespace "diff":
    void diff_cpp(double* at, double* a, double visc, double dxidxi, double dyidyi, double dzidzi, int itot, int jtot, int ktot)

@cython.boundscheck(False)
@cython.wraparound(False)
def diff(np.ndarray[double, ndim=3, mode="c"] at not None,
         np.ndarray[double, ndim=3, mode="c"] a not None,
         double visc, double dxidxi, double dyidyi, double dzidzi):
    cdef int ktot, jtot, itot
    ktot, jtot, itot = at.shape[0], at.shape[1], at.shape[2]
    diff_cpp(&at[0,0,0], &a[0,0,0], visc, dxidxi, dyidyi, dzidzi, itot, jtot, ktot)
    return None

我用Python调用此函数

I call this function in Python

import numpy as np
import diff
import time

nloop = 20;
itot = 256;
jtot = 256;
ktot = 256;
ncells = itot*jtot*ktot;

at = np.zeros((ktot, jtot, itot))

index = np.arange(ncells)
a = (index/(index+1))**2
a.shape = (ktot, jtot, itot)

# Check results
diff.diff(at, a, 0.1, 0.1, 0.1, 0.1)
print("at={0}".format(at.flatten()[itot*jtot+itot+itot//2]))

# Time the loop
start = time.perf_counter()
for i in range(nloop):
    diff.diff(at, a, 0.1, 0.1, 0.1, 0.1)
end = time.perf_counter()

print("Time/iter: {0} s ({1} iters)".format((end-start)/nloop, nloop))

这是setup.py:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

import numpy

setup(
    cmdclass = {'build_ext': build_ext},
    ext_modules = [Extension("diff",
                             sources=["diff.pyx"],
                             language="c++",
                             extra_compile_args=["-Ofast -march=native"],
                             include_dirs=[numpy.get_include()])],
)

这里的C ++参考文件达到了两倍的性能:

And here the C++ reference file that reaches twice the performance:

#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <stdlib.h>
#include <cstdio>
#include <ctime>
#include "math.h"

void init(double* const __restrict__ a, double* const __restrict__ at, const int ncells)
{
    for (int i=0; i<ncells; ++i)
    {
        a[i]  = pow(i,2)/pow(i+1,2);
        at[i] = 0.;
    }
}

void diff(double* const __restrict__ at, const double* const __restrict__ a, const double visc, 
          const double dxidxi, const double dyidyi, const double dzidzi, 
          const int itot, const int jtot, const int ktot)
{
    const int ii = 1;
    const int jj = itot;
    const int kk = itot*jtot;

    for (int k=1; k<ktot-1; k++)
        for (int j=1; j<jtot-1; j++)
            for (int i=1; i<itot-1; i++)
            {
                const int ijk = i + j*jj + k*kk;
                at[ijk] += visc * (
                        + ( (a[ijk+ii] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-ii]) ) * dxidxi 
                        + ( (a[ijk+jj] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-jj]) ) * dyidyi
                        + ( (a[ijk+kk] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-kk]) ) * dzidzi
                        );
            }
}

int main()
{
    const int nloop = 20;
    const int itot = 256;
    const int jtot = 256;
    const int ktot = 256;
    const int ncells = itot*jtot*ktot;

    double *a  = new double[ncells];
    double *at = new double[ncells];

    init(a, at, ncells);

    // Check results
    diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot); 
    printf("at=%.20f\n",at[itot*jtot+itot+itot/2]);

    // Time performance 
    std::clock_t start = std::clock(); 

    for (int i=0; i<nloop; ++i)
        diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot); 

    double duration = (std::clock() - start ) / (double)CLOCKS_PER_SEC;

    printf("time/iter = %f s (%i iters)\n",duration/(double)nloop, nloop);

    return 0;
}

推荐答案

这里的问题不是运行期间发生的事情,而是编译期间发生的优化.

The problem here is not what is happening during the run, but which optimization is happening during the compilation.

完成哪种优化取决于编译器(甚至版本),并且不能保证可以完成的所有优化都将完成.

Which optimization is done depends on the compiler (or even version) and there is no guarantee that every optimization, which can be done will be done.

实际上,有两种不同的原因导致cython变慢,具体取决于您使用的是g ++还是clang ++:

Actually there are two different reasons why cython is slower, depending on whether you use g++ or clang++:

  • 由于cython构建中的标志-fwrapv,g ++无法优化
  • clang ++首先无法进行优化(请继续阅读以了解会发生什么情况.)
  • g++ is unable to optimize due to flag -fwrapv in the cython build
  • clang++ is unable to optimize in the first place (read on to see what happens).

第一个问题(g ++):与纯c ++程序的标志相比,Cython编译时具有不同的标志,因此无法进行某些优化.

First issue (g++): Cython compiles with different flags compared to the flags of your pure c++-program and as result some optimizations can't be done.

如果您查看设置日志,则会看到:

If you look at the log of the setup, you will see:

 x86_64-linux-gnu-gcc ... -O2 ..-fwrapv .. -c diff.cpp ... -Ofast -march=native

正如您所说,-Ofast将胜过-O2,因为它排在最后.但是问题出在-fwrapv上,它似乎阻止了一些优化,因为带符号的整数溢出不能被认为是UB,不再用于优化.

As you told, -Ofast will win against -O2because it comes last. But the problem is -fwrapv, which seems to prevent some optimization, as signed integer overflow cannot be considered UB and used for optimization any longer.

因此,您有以下选择:

  • -fno-wrapv添加到extra_compile_flags中,缺点是所有文件现在都使用已更改的标志进行编译,这可能是不需要的.
  • 仅使用您喜欢的标志从cpp构建一个库,并将其链接到您的cython模块.该解决方案具有一些开销,但具有健壮性的优势:正如您指出的,对于不同的编译器,不同的cython标志可能是问题所在-因此第一个解决方案可能太脆弱了.
  • 不确定您可以禁用默认标志,但是文档中可能包含一些信息.
  • add -fno-wrapv to extra_compile_flags, the disadvantage is, that all files are now compiled with changed flags, what might be unwanted.
  • build a library from cpp with only flags you like and link it to your cython module. This solution has some overhead, but has the advantage of being robust: as you pointed out for different compilers different cython-flags could be the problem - so the first solution might be too brittle.
  • not sure you can disable default flags, but maybe there is some information in docs.

第二个问题(clang ++).

当我用相当老的5.4版本g ++编译您的cpp程序时:

When I compile your cpp-program with my pretty old 5.4-version g++:

 g++ test.cpp -o test -Ofast -march=native -fwrapv

与没有-fwrapv的编译相比,

慢了将近3倍.但是,这是优化程序的一个弱点:内联时,应该看到没有可能发生带符号整数溢出(所有尺寸都约为256),因此标志-fwrapv不应有任何影响.

it becomes almost 3-times slower compared to the compilation without -fwrapv. This is however a weakness of the optimizer: When inlining, it should see, that no signed-integer overflow is possible (all dimensions are about 256), so the flag -fwrapv shouldn't have any impact.

我以前的clang++版本(3.8)似乎在这里做得更好:使用上面的标志,我看不到性能的任何下降.我需要通过-fno-inline禁用内联以使其成为较慢的代码,但是即使没有-fwrapv也是如此,即:

My old clang++-version (3.8) seems to do a better job here: with the flags above I cannot see any degradation of the performance. I need to disable inlining via -fno-inline to become a slower code but it is slower even without -fwrapv i.e.:

 clang++ test.cpp -o test -Ofast -march=native -fno-inline

因此,倾向于使用c ++程序会出现系统性偏差:内联后,优化器可以针对已知值优化代码,这是cython无法做到的.

So there is a systematical bias in favor of your c++-program: the optimizer can optimize the code for the known values after the inlining - something the cython can not do.

因此我们可以看到:clang ++不能以任意大小优化function diff,但是可以将其优化为size = 256.但是,Cython只能使用未优化的diff版本.这就是为什么-fno-wrapv没有积极影响的原因.

So we can see: clang++ was not able to optimize function diff with arbitrary sizes but was able to optimize it for size=256. Cython however, can only use the not optimized version of diff. That is the reason, why -fno-wrapv has no positive impact.

我的收获:禁止在cpp-tester中内联感兴趣的功能(例如,将其编译到自己的目标文件中),以确保与cython保持平衡,否则,人们会认为该程序的性能特别高针对这一输入进行了优化.

My take-away from it: disallow inlining of the function of interest (e.g. compile it in its own object file) in the cpp-tester to ensure a level ground with cython, otherwise one sees performance of a program which was specially optimized for this one input.

NB:有趣的是,如果所有int都替换为unsigned int,那么-fwrapv自然不会起任何作用,但是带有unsigned int的版本与-fwrapv的>版本,这是合乎逻辑的,因为没有未定义的行为可被利用.

NB: A funny thing is, that if all ints are replaced by unsigned ints, then naturally -fwrapv doesn't play any role, but the version with unsigned int is as slow as int-version with -fwrapv, which is only logical, as there is no undefined behavior to be exploited.

这篇关于Cython中C ++函数的性能不佳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆