性能比较Fortran,Numpy,Cython和Numexpr [英] Performance comparison Fortran, Numpy,Cython and Numexpr

查看:223
本文介绍了性能比较Fortran,Numpy,Cython和Numexpr的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下功能:

$ p $ def get_denom(n_comp,qs,x,cp,cs):
'''
len(n_comp)= 1#蛋白质数量
len(cp)= n_comp#蛋白质浓度
len(qp)= n_comp#蛋白质容量
len(x )= 3 * n_comp + 1#拟合参数
len(cs)= 1

'''
k = x [0:n_comp]
sigma = x [ n_comp:2 * n_comp]
z = x [2 * n_comp:3 * n_comp]

a =(sigma + z)*(k *(qs / cs)**(z- 1 ))* cp
denom = np.sum(a)+ cs
return denom



 子程序get_denom(qs,x,cp) ,cs,n_comp,denom)

!计算SMA模型中的分母(Brooks and Cramer 1992)
!该函数在特定的盐浓度和等温点下调用
!我遍历了组件的数量

隐式无

!输入变量的声明
整数,意图(in):: n_comp!组件数量
双精度,意图(in):: cs,qs!盐浓度,游离配体浓度
双精度,维数(n_comp),INTENT(IN):: cp!蛋白质浓度
双精度,维(3 * n_comp + 1),INTENT(IN):: x!参数

!局部变量的声明
双精度,维(n_comp):: k,sigma,z
双精度:: a
整数:: i

! outpur变量的声明
double precision,intent(out):: denom

k = x(1:n_comp)! equlibrium常数
sigma = x(n_comp + 1:2 * n_comp)!空间位阻因子
z = x(2 * n_comp + 1:3 * n_comp)! (i)+(sigma(i)+ z(i))*(k(i)*((b))的蛋白质的收费

a = 0.
do i = 1,n_comp
a = a + )*(z(i)-1))* cp(i)
end do

denom = a + cs

结束子程序get_denom

通过使用以下代码编译.f95文件:

)1) f2py -c -m get_denom get_denom.f95 --fcompiler = gfortran



2) f2py -c -m get_denom_vec get_denom.f95 --fcompiler = gfortran --f90flags =' - msse2'(最后一个选项应该打开自动向量化)



我测试函数:

  import numpy as np 
import get_denom as fort_denom
import get_denom_vec as fort_denom_vec
from matplotlib import pyplot as plt
%matplotlib inline
$ b $ def get_denom(n_comp,qs,x,cp,cs):
k = x [0:n_comp]
sigma = x [n_comp:2 * n_comp]
z = x [2 * n_comp:3 * n_comp]
#计算Equal中的分母14a-14c(Brooks& Cramer 1992)
a =(si (a)+ cs
return denom

n_comp = 100
cp = np.tile(1.243,n_comp)
cs = 100.
qs = np.tile(1100.,n_comp)
x = np.random。 rand(3 * n_comp + 1)
denom = np.empty(1)
%timeit get_denom(n_comp,qs,x,cp,cs)
%timeit fort_denom.get_denom(qs, x,cp,cs,n_comp)
%timeit fort_denom_vec.get_denom(qs,x,cp,cs,n_comp)

我添加了下面的Cython代码:

  import cython 
#import the numpy and the Cython numpy的声明
import numpy as np
cimport numpy as np

@ cython.boundscheck(False)
@ cython.wraparound(False)
def get_denom(int n_comp,np.ndarray [double,ndim = 1,mode ='c'] qs,np.ndarray [double,ndim = 1,mode ='c'] x,np.ndarray [double,ndim = 1 ,mode ='c'] cp,double cs):

cdef int i
cdef double a
cdef double denom
cdef double [:] k = x [0:n_comp]
cdef dou ble [:] sigma = x [n_comp:2 * n_comp]
cdef double [:] z = x [2 * n_comp:3 * n_comp]
#计算Equ 14a - 14c中的分母(Brooks &安培; Cramer 1992)
a = 0.
(范围内的n_comp):
#a + =(sigma [i] + z [i])*(pow(k [i] * (qs [i] / cs),(z [i] -1)))* cp [i]
a + =(sigma [i] + z [i])*(k [i] * [i] / cs)**(z [i] -1))* cp [i]

denom = a + cs

return denom

编辑:

新增Numexpr,使用一个线程: p>

  def get_denom_numexp(n_comp,qs,x,cp,cs):
k = x [0:n_comp]
sigma = x [n_comp:2 * n_comp]
z = x [2 * n_comp:3 * n_comp]
#计算Equ 14a - 14c中的分母(Brooks& Cramer 1992)
a = ne.evaluate('(sigma + z)*(k *(qs / cs)**(z-1))* cp')
return cs + np.sum(a)

ne.set_num_threads(1)#仅使用1个线程
%timeit get_denom_numexp(n_comp,qs,x,cp,cs)

结果是(越小越好):



为什么随着阵列大小的增加,Fortran的速度越来越接近Numpy了吗?我怎么能加快Cython?使用指针?

解决方案

确定它。



我们被允许在我们的一个盒子上安装Numpy等,这可以对您的原始帖子进行全面的解释。



简短的回答是,您的原件问题在某种意义上是错误的问题。此外,其中一位参与者有过多次无理取闹和错误信息,这些错误和捏造值得关注,以免任何人误以为他们相信他们,并损害他们的成本。

另外,我决定提交这个答案作为一个单独的答案,而不是编辑我的回答4月14日,由于下面的原因,并适当。

A部分:OP的答案首先,在原始文章中处理该问题:您可能还记得,我只能对Fortran方面发表评论,因为我们的政策对于可以安装哪些软件以及我们的机器上的哪些地方是严格的,并且我们没有Python等(直到现在)。我也曾多次表示,你的结果的特点是有趣的,我们可以称之为弯曲的角色,或者也许是凹陷。

另外,纯粹与相对的结果(因为你没有发布绝对时机,当时我也没有拿到Numpy),我曾多次指出可能潜藏着一些重要的信息。

正是这种情况。



首先,我想确保能够重现您的结果,因为我们不使用Python / F2py通常,在结果中隐含着什么编译器设置等是不明显的,所以我进行了各种测试以确保我们正在谈论苹果对苹果(我的Apr 14答案表明,Debug vs Release / O2有很大的不同)。

图1显示了我的Python结果仅仅包含以下三种情况:您的原始Python / Numpy内部子程序(称为P,我只是剪切/粘贴您的原始内容) ,你原来的基于Do的Fortran s / r(称为这个FDo,我只是复制/粘贴你的原始文件,和以前一样),以及我之前提到的依赖于Array Sections的变体之一,因此需要Sum()(调用此FSAS,通过编辑您的原始FDo创建)。图1显示了通过timeit的绝对时序。



图2显示相对结果基于您的Python / Numpy(P)时序划分的相对策略。只显示了两个(相对)Fortran变体。



显然,那些重现原始情节中的角色,我们可能会确信我的测试似乎与您的测试一致。



现在,您的原始问题是为什么是随着阵列大小的增加,Fortran的速度越来越接近Numpy了?。

其实并非如此。这纯粹是一种纯粹依靠相对时机的人为或扭曲,可能会给人留下深刻的印象。从图1可以看出,在绝对时机的情况下, Numpy和Fortran是分歧的。所以,事实上,如果你喜欢的话,Fortran的结果正在从Numpy转移,反之亦然。

更好的问题,以及我之前评论中反复提到的一个问题,这就是为什么这些曲线首先向上弯曲(例如线性曲线)?我以前的仅限于Fortran的结果显示了大部分平坦的相对性能比(我的Apr 14图表/答案中的黄线),所以我推测在Python方面发生了一些有趣的事情,并建议检查它。



显示这一点的一种方式是用另一种相对度量。我将每个(绝对)序列用它自己的最高值(即在n_comp = 10k)进行分割,看看这个内部相对性能如何展开(这些被称为10k值,代表n_comp = 10,000的时序) 。图3将P,FDo和FSAS的这些结果显示为P / P10k,FDo / FDo10k和FSAS / FSAS10k。为了清楚起见,y轴具有对数尺度。很显然,Fortran变体预制相对非常好,降低了n_comp c.f. P结果(例如红圈注释部分)。



换句话说, Fortran非常非常(非线性)更有效地减小数组大小。不确定为什么Python会因为减少n_comp而做得更糟糕......但它确实存在,并且可能是内部开销/设置等问题,以及解释器与编译器之间的差异等。

b
$ b

所以,并不是说Fortran正在赶上Python,恰恰相反,它正在继续与Python保持距离(参见图1)。但是,随着n_comp的减少,Python和Fortran之间的非线性差异会产生相对的时序,并且显然与直觉和非线性特性相反。因此,随着n_comp增加并且每种方法稳定到或多或少的线性模式,曲线变平坦以表明它们的差异正在线性增长,并且相对比率建立一个近似的常量(忽略内存争用,smp问题等)......如果允许n_comp> 10k,这很容易看出来,但是在我的4月14日答案中的黄线已经显示了这个仅限于Fortran的s / r's。

另外:我的首选是创建我自己的定时例程/函数。时间似乎很方便,但在黑匣子内部还有很多事情要做。设置您自己的循环和结构,以及确定您的计时功能的属性/分辨率对于进行适当的评估非常重要。

I have following function:

def get_denom(n_comp,qs,x,cp,cs):
'''
len(n_comp) = 1 # number of proteins
len(cp) = n_comp # protein concentration
len(qp) = n_comp # protein capacity
len(x) = 3*n_comp + 1 # fit parameters
len(cs) = 1

'''
    k = x[0:n_comp]
    sigma = x[n_comp:2*n_comp]
    z = x[2*n_comp:3*n_comp]

    a = (sigma + z)*( k*(qs/cs)**(z-1) )*cp
    denom = np.sum(a) + cs
    return denom

I compare it against a Fortran implementation (My first Fortran function ever):

subroutine get_denom (qs,x,cp,cs,n_comp,denom)

! Calculates the denominator in the SMA model (Brooks and Cramer 1992)
! The function is called at a specific salt concentration and isotherm point
! I loops over the number of components

implicit none

! declaration of input variables
integer, intent(in) :: n_comp ! number of components
double precision, intent(in) :: cs,qs ! salt concentration, free ligand concentration
double precision, dimension(n_comp), INTENT(IN) ::cp ! protein concentration
double precision, dimension(3*n_comp + 1), INTENT(IN) :: x ! parameters

! declaration of local variables
double precision, dimension(n_comp) :: k,sigma,z
double precision :: a
integer :: i

! declaration of outpur variables
double precision, intent(out) :: denom

k = x(1:n_comp) ! equlibrium constant
sigma = x(n_comp+1:2*n_comp) ! steric hindrance factor
z = x(2*n_comp+1:3*n_comp) ! charge of protein

a = 0.
do i = 1,n_comp
    a = a + (sigma(i) + z(i))*(k(i)*(qs/cs)**(z(i)-1.))*cp(i)
end do

denom = a + cs

end subroutine get_denom

I compiled the .f95 file by using:

1) f2py -c -m get_denom get_denom.f95 --fcompiler=gfortran

2) f2py -c -m get_denom_vec get_denom.f95 --fcompiler=gfortran --f90flags='-msse2' (The last option should turn on auto-vectorization)

I test the functions by:

import numpy as np
import get_denom as fort_denom
import get_denom_vec as fort_denom_vec
from matplotlib import pyplot as plt
%matplotlib inline

def get_denom(n_comp,qs,x,cp,cs):
    k = x[0:n_comp]
    sigma = x[n_comp:2*n_comp]
    z = x[2*n_comp:3*n_comp]
    # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992)
    a = (sigma + z)*( k*(qs/cs)**(z-1) )*cp
    denom = np.sum(a) + cs
    return denom

n_comp = 100
cp = np.tile(1.243,n_comp)
cs = 100.
qs = np.tile(1100.,n_comp)
x= np.random.rand(3*n_comp+1)
denom = np.empty(1)
%timeit get_denom(n_comp,qs,x,cp,cs)
%timeit fort_denom.get_denom(qs,x,cp,cs,n_comp)
%timeit fort_denom_vec.get_denom(qs,x,cp,cs,n_comp)

I added following Cython code:

import cython
# import both numpy and the Cython declarations for numpy
import numpy as np
cimport numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def get_denom(int n_comp,np.ndarray[double, ndim=1, mode='c'] qs, np.ndarray[double, ndim=1, mode='c'] x,np.ndarray[double, ndim=1, mode='c'] cp, double cs):

    cdef int i
    cdef double a
    cdef double denom   
    cdef double[:] k = x[0:n_comp]
    cdef double[:] sigma = x[n_comp:2*n_comp]
    cdef double[:] z = x[2*n_comp:3*n_comp]
    # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992)
    a = 0.
    for i in range(n_comp):
    #a += (sigma[i] + z[i])*( pow( k[i]*(qs[i]/cs), (z[i]-1) ) )*cp[i]
        a += (sigma[i] + z[i])*( k[i]*(qs[i]/cs)**(z[i]-1) )*cp[i]

    denom = a + cs

    return denom

EDIT:

Added Numexpr, using one thread:

def get_denom_numexp(n_comp,qs,x,cp,cs):
    k = x[0:n_comp]
    sigma = x[n_comp:2*n_comp]
    z = x[2*n_comp:3*n_comp]
    # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992)
    a = ne.evaluate('(sigma + z)*( k*(qs/cs)**(z-1) )*cp' )
    return cs + np.sum(a)

ne.set_num_threads(1)  # using just 1 thread
%timeit get_denom_numexp(n_comp,qs,x,cp,cs)

The result is (smaller is better):

Why is is the speed of Fortran getting closer to Numpy with increasing size of the arrays? And how could i speed up Cython? Using pointers?

解决方案

Sussed It.

OK, finally, we were permitted to install Numpy etc on one of our boxes, and that has allowed what may be a comprehensive explanation of your original post.

The short answer is that your original questions is, in a sense, "the wrong question". In addition, there has been much vexatious abuse and misinformation by one of the contributors, and those errors and fabrications deserve attention, lest anyone make the mistake of believing them, and to their cost.

Also, I have decided to submit this answer as a separate answer, rather than editing my answer of Apr 14, for reasons seen below, and propriety.

Part A: The Answer to the OP

First things first, dealing with the question in the original post: You may recall I could only comment wrt to the Fortran side, since our policies are strict about what software may be installed and where on our machines, and we did not have Python etc to hand (until just now). I had also repeatedly stated that the character of your result was interesting in terms of what we can call its curved character or perhaps "concavity".

In addition, working purely with "relative" results (as you did not post the absolute timings, and I did not have Numpy to hand at the time), I had indicated a few times that some important information may be lurking therein.

That is precisely the case.

First, I wanted to be sure I could reproduce your results, since we don't use Python/F2py normally, it was not obvious what compiler settings etc are implied in your results, so I performed a variety of tests to be sure we are talking apples-to-apples (my Apr 14 answer demonstrated that Debug vs Release/O2 makes a big difference).

Figure 1 shows my Python results for just the three cases of: your original Python/Numpy internal sub-program (call this P, I just cut/pasted your original), your original Do based Fortran s/r you had posted (call this FDo, I just copy/pasted your original, as before), and one of the variations I had suggested earlier relying on Array Sections, and thus requiring Sum() (call this FSAS, created by editing your original FDo). Figure 1 shows the absolute timings via timeit.

Figure 2 shows the relative results based on your relative strategy of dividing by the Python/Numpy (P) timings. Only the two (relative) Fortran variants are shown.

Clearly, those reproduce the character in your original plot, and we may be confident that my tests seem to be consistent with your tests.

Now, your original question was "Why is is the speed of Fortran getting closer to Numpy with increasing size of the arrays?".

In fact, it is not. It is purely an artefact/distortion of relying purely on "relative" timings that may give that impression.

Looking at Figure 1, with the absolute timings, it is clear the Numpy and Fortran are diverging. So, in fact, the Fortran results are moving away from Numpy or vice versa, if you like.

A better question, and one which arose repeatedly in my previous comments, is why are these upward curving in the first place (c.f. linear, for example)? My previous Fortran-only results showed a "mostly" flat relative performance ratio (yellow lines in my Apr 14 chart/answer), and so I had speculated that there was something interesting happening on the Python side and suggested checking that.

One way to show this is with yet a different kind of relative measure. I divided each (absolute) series with its own highest value (i.e. at n_comp = 10k), to see how this "internal relative" performance unfolds (those are referred to as the ??10k values, representing the timings for n_comp = 10,000). Figure 3 shows these results for P, FDo, and FSAS as P/P10k, FDo/FDo10k, and FSAS/FSAS10k. For clarity, the y-axis has a logarithmic scale. It is clear that the Fortran variants preform relatively very much better with decreasing n_comp c.f. the P results (e.g. the red circle annotated section).

Put differently, Fortran is very very (non-linearly) more efficient for decreasing array size. Not sure exactly why Python would do so much worse with decreasing n_comp ... but there it is, and may be an issue with internal overhead/set-up etc., and the differences between interpreters vs compilers etc.

So, it's not that Fortran is "catching up" with Python, quite the opposite, it is continuing to distance itself from Python (see Figure 1). However, the differences in the non-linearities between Python and Fortran with decreasing n_comp produce "relative" timings with apparently counter-intuitive and non-linear character.

Thus, as n_comp increases and each method "stabilises" to a more or less linear mode, the curves flatten to show that their differences are growing linearly, and the relative ratios settle to an approximate constant (ignoring memory contention, smp issues, etc.) ... this is easier to see if n_comp is allowed > 10k, but the yellow line in my Apr 14 answer already show this for the Fortran-only s/r's.

Aside: My preference is to create my own timing routines/functions. timeit seems convenient, but there is much going on inside that "black box". Setting your own loops and structures, and being certain of the properties/resolution of your timing functions is important towards a proper assessment.

这篇关于性能比较Fortran,Numpy,Cython和Numexpr的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆