为什么我的f2py程序比python程序慢 [英] Why is my f2py programs slower than python programs

查看:343
本文介绍了为什么我的f2py程序比python程序慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近用python编写了一个耗时的程序,并决定用fortran重写最耗时的部分。

然而,用f2py包装的fortran代码是比Python代码慢,任何人都可以告诉我如何找到这里发生的事情?



作为参考,这里是python函数:

  def iterative_method(alpha0,beta0,epsilon0,epsilons0,omega,smearing = 0.01,precision = 0.01,max_step = 20,flag = 0):
#alpha0 ,beta0,epsilon0,epsilons0是numpy数组
m,n = np.shape(epsilon0)
Omega = np.eye(m,dtype = np.complex)*(omega + smearing * 1j)
green = LA.inv(Omega - epsilon0)#LA是numpy.linalg
alpha = np.dot(alpha0,np.dot(green,alpha0))
beta = np.dot(beta0 ,np.dot(绿色,beta0))
epsilon = epsilon0 + np.dot(alpha0,np.dot(green,beta0))+ np.dot(beta0,np.dot(green,alpha0))
epsilons = epsilons0 + np.dot(alpha0,np.dot(绿色,beta0))

而np.max(np.abs(alpha0))>精度和np.max(np.abs(beta0))>精度和标志< max_step:
flag + = 1
return iterative_method(alpha,beta,epsilon,epsilons,omega,smearing,precision,min_step,max_step,flag)
返回epsilon,epsilons,flag

相应的fortran代码是

  SUBROUTINE iterate(eout,esout,alpha,beta,e,es,omega,smearing,prec,max_step,rank)
INTEGER,PARAMETER :: dp = kind(1.0d0)
REAL(kind = dp):: omega,smearing,prec
INTEGER :: max_step,step,rank,cnt
COMPLEX(kind = dp):: alpha(rank,rank),beta(rank,等级),omega_mat(等级,等级),&
green(rank,rank),e(rank,rank),es(rank,rank)
COMPLEX(kind = dp),intent(out):: eout(rank,rank),esout等级)
step = 0
omega_mat = 0
DO cnt = 1,等级
omega_mat(cnt,cnt)= 1.0_dp
ENDDO
omega_mat = omega_mat *(omega +(0.0_dp,1.0_dp)* smearing)
DO WHILE(maxval(abs(alpha))。prec .or。maxval(abs(beta))。prec。prec。和。step.lt。max_step)
green = zInverse(rank,omega_mat - e)! zInverse调用lapack来计算矩阵的逆矩阵
e = e + matmul(alpha,matmul(green,beta))+ matmul(beta,matmul(green,alpha))
es = es + matmul alpha,matmul(green,beta))
alpha = matmul(alpha,matmul(green,alpha))
beta = matmul(beta,matmul(green,beta))
step = step + 1
ENDDO
print *,step
eout = e
esout = es
END SUBROUTINE iterate

在一个测试中,python代码使用大约5秒,而fortran代码使用大约7秒,这是很难接受的。另外,我几乎看不到fortran代码中的任何开销。是否包装被指责?



编辑:我没有使用 BlAS matmul 。使用 BLAS 后,fortran和python的性能都在5秒左右。

首先,对Python代码执行 this ,以便确切了解它如何花费时间。
然后,如果你喜欢,你可以使用调试器在Fortran代码上做类似的事情。



我基本怀疑所有的时间进入矩阵运算,所以任何速度差异归因于数学库,而不归因于调用它的语言。
这篇文章 重申了我的一些经验。
通常做矩阵乘法,逆矩阵或Cholesky变换等例程的目的是在大矩阵上有效,但不能小。


例如,LAPACK矩阵乘法例程DGEMM具有两个字符参数TRANSA和TRANSB,它们可以是大写或小写,指定每个输入矩阵是否被转置。
为了检查这些参数的值,它调用一个函数LSAME。
我发现,如果我花费很大一部分时间乘以小矩阵(如4x4),程序实际上几乎将所有时间花费在调用LSAME上,而实际上几乎没有时间乘以矩阵。
您可以看到如何解决这个问题。


I recently wrote a time consuming program with python and decided to rewrite the most time consuming part with fortran.

However, the fortran code, wrapped with f2py, is slower than python code, Can anyone tell me how to find what is happening here?

For reference, here's the python function:

def iterative_method(alpha0, beta0, epsilon0, epsilons0, omega, smearing=0.01, precision=0.01, max_step=20, flag=0):
    # alpha0, beta0, epsilon0, epsilons0 are numpy arrays
    m, n = np.shape(epsilon0)
    Omega = np.eye(m, dtype=np.complex) * (omega + smearing * 1j)
    green = LA.inv(Omega - epsilon0) # LA is numpy.linalg
    alpha = np.dot(alpha0, np.dot(green, alpha0))
    beta = np.dot(beta0, np.dot(green, beta0))
    epsilon = epsilon0 + np.dot(alpha0, np.dot(green, beta0)) + np.dot(beta0, np.dot(green, alpha0))
    epsilons = epsilons0 + np.dot(alpha0, np.dot(green, beta0))

    while np.max(np.abs(alpha0)) > precision and np.max(np.abs(beta0)) > precision and flag < max_step:
        flag += 1
        return iterative_method(alpha, beta, epsilon, epsilons, omega, smearing, precision, min_step, max_step, flag)
return epsilon, epsilons, flag

The corresponding fortran code is

SUBROUTINE iterate(eout, esout, alpha, beta, e, es, omega, smearing, prec, max_step, rank)
    INTEGER, PARAMETER :: dp = kind(1.0d0)
    REAL(kind=dp) :: omega, smearing, prec
    INTEGER :: max_step, step, rank, cnt
    COMPLEX(kind=dp) :: alpha(rank,rank), beta(rank,rank), omega_mat(rank, rank),&
     green(rank, rank), e(rank,rank), es(rank,rank)
    COMPLEX(kind=dp), INTENT(out) :: eout(rank, rank), esout(rank, rank)
    step = 0
    omega_mat = 0
    DO cnt=1, rank
        omega_mat(cnt, cnt) = 1.0_dp
    ENDDO
    omega_mat = omega_mat * (omega + (0.0_dp, 1.0_dp) * smearing)
    DO WHILE (maxval(abs(alpha)) .gt. prec .or.  maxval(abs(beta)) .gt. prec .and. step .lt. max_step)
        green = zInverse(rank, omega_mat - e) ! zInverse is calling lapack to compute inverse of the matrix
        e = e + matmul(alpha, matmul(green, beta)) + matmul(beta, matmul(green, alpha))
        es = es + matmul(alpha, matmul(green, beta))
        alpha = matmul(alpha, matmul(green, alpha))
        beta = matmul(beta, matmul(green, beta))
        step = step + 1
    ENDDO
    print *, step
    eout = e
    esout = es
END SUBROUTINE iterate

In a test, python code used about 5 seconds while fortran code used about 7 seconds, which is hardly acceptable. Also, I can hardly see any overhead in fortran code. Is the wrapper to be blamed?

Edit: I didn't use BlAS for matmul. After using BLAS, fortran and python performace are both around 5 seconds.

解决方案

First, do this on the python code so you know exactly how it spends its time. Then, you can do a similar thing on the Fortran code using a debugger, if you like.

I suspect essentially all of the time goes into matrix operations, so any speed difference is due to the math library, not to the language that calls it. This post relays some of my experience doing that. Often the routines to do things like matrix multiplication, inverse, or Cholesky transform, are designed to be efficient on large matrices, but not on small.

For example, the LAPACK matrix-multiplication routine DGEMM has two character arguments, TRANSA and TRANSB, which can be upper or lower case, specifying whether each input matrix is transposed. To examine the value of those arguments, it calls a function LSAME. I found that, if I am spending a large fraction of my time multiplying small matrices, like 4x4, the program actually spends nearly all of its time calling LSAME, and very little time actually multiplying matrices. You can see how it would be easy to fix that.

这篇关于为什么我的f2py程序比python程序慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆