需要帮助矢量化代码或优化 [英] Need help vectorizing code or optimizing

查看:119
本文介绍了需要帮助矢量化代码或优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过首先对数据进行插值以制成曲面来进行双积分.我正在使用numba尝试加快此过程的速度,但这花了太长时间.

I am trying to do a double integral by first interpolating the data to make a surface. I am using numba to try and speed this process up, but it's just taking too long.

这是我的代码,其中包含用于运行位于此处

Here is my code, with the images needed to run the code located at here and here.

推荐答案

注意到您的代码具有四重嵌套的for循环集,因此我专注于优化内部对.这是旧的代码:

Noting that your code has a quadruple-nested set of for loops, I focused on optimizing the inner pair. Here's the old code:

for i in xrange(K.shape[0]):
    for j in xrange(K.shape[1]):

        print(i,j)
        '''create an r vector '''
        r=(i*distX,j*distY,z)

        for x in xrange(img.shape[0]):
            for y in xrange(img.shape[1]):
                '''create an ksi vector, then calculate
                   it's norm, and the dot product of r and ksi'''
                ksi=(x*distX,y*distY,z)
                ksiNorm=np.linalg.norm(ksi)
                ksiDotR=float(np.dot(ksi,r))

                '''calculate the integrand'''
                temp[x,y]=img[x,y]*np.exp(1j*k*ksiDotR/ksiNorm)

        '''interpolate so that we can do the integral and take the integral'''
        temp2=rbs(a,b,temp.real)
        K[i,j]=temp2.integral(0,n,0,m)

由于K和img的大小均为2000x2000,因此最里面的语句需要执行16万亿次.使用Python根本不可行,但是我们可以使用NumPy将工作转移到C和/或Fortran中进行矢量化.我一次仔细地执行了这一步,以确保结果匹配.这就是我最终得到的:

Since K and img are each about 2000x2000, the innermost statements need to be executed sixteen trillion times. This is simply not practical using Python, but we can shift the work into C and/or Fortran using NumPy to vectorize. I did this one careful step at a time to try to make sure the results will match; here's what I ended up with:

'''create all r vectors'''
R = np.empty((K.shape[0], K.shape[1], 3))
R[:,:,0] = np.repeat(np.arange(K.shape[0]), K.shape[1]).reshape(K.shape) * distX
R[:,:,1] = np.arange(K.shape[1]) * distY
R[:,:,2] = z

'''create all ksi vectors'''
KSI = np.empty((img.shape[0], img.shape[1], 3))
KSI[:,:,0] = np.repeat(np.arange(img.shape[0]), img.shape[1]).reshape(img.shape) * distX
KSI[:,:,1] = np.arange(img.shape[1]) * distY
KSI[:,:,2] = z

# vectorized 2-norm; see http://stackoverflow.com/a/7741976/4323                                                    
KSInorm = np.sum(np.abs(KSI)**2,axis=-1)**(1./2)

# loop over entire K, which is same shape as img, rows first                                                        
# this loop populates K, one pixel at a time (so can be parallelized)                                               
for i in xrange(K.shape[0]):                                                                                    
    for j in xrange(K.shape[1]):                                                                                

        print(i, j)

        KSIdotR = np.dot(KSI, R[i,j])
        temp = img * np.exp(1j * k * KSIdotR / KSInorm)

        '''interpolate so that we can do the integral and take the integral'''
        temp2 = rbs(a, b, temp.real)
        K[i,j] = temp2.integral(0, n, 0, m)

内部的一对循环现在已完全消失,由预先完成的矢量化操作代替(空间成本与输入大小成线性关系).

The inner pair of loops is now completely gone, replaced by vectorized operations done in advance (at a space cost linear in the size of the inputs).

这使我的Macbook Air 1.6 GHz i5上的两个外部循环的每次迭代时间从340秒减少到1.3秒,而无需使用Numba.在每次迭代的1.3秒中,rbs函数(scipy.interpolate.RectBivariateSpline)花费了0.68秒.可能还有进一步优化的空间-这里有一些想法:

This reduces the time per iteration of the outer two loops from 340 seconds to 1.3 seconds on my Macbook Air 1.6 GHz i5, without using Numba. Of the 1.3 seconds per iteration, 0.68 seconds are spent in the rbs function, which is scipy.interpolate.RectBivariateSpline. There is probably room to optimize further--here are some ideas:

  1. 重新启用Numba.我的系统上没有它.此时可能并没有太大的区别,但您可以轻松进行测试.
  2. 进行更多特定于域的优化,例如尝试简化正在执行的基本计算.我的优化旨在做到无损,而且我也不知道您的问题所在,因此我无法尽您所能进行深度优化.
  3. 尝试对其余循环进行矢量化处理.除非您愿意将scipy RBS功能替换为支持每次调用多个计算的功能,否则这可能会很困难.
  4. 获得更快的CPU.我的速度很慢.通过使用比我的小型笔记本电脑更好的计算机,您可能可以使速度至少提高2倍.
  5. 对数据进行下采样.您的测试图像为2000x2000像素,但包含的细节很少.如果将它们的线性尺寸减少2-10倍,则可以大大提高速度.

现在就这些.这会把你留在哪里?假设计算机稍好一些,并且没有进一步的优化工作,那么即使是经过优化的代码也需要大约一个月的时间来处理您的测试图像.如果您只需要这样做一次,那也许很好.如果您需要更频繁地执行此操作,或者在尝试其他操作时需要迭代代码,则可能需要继续进行优化-从该RBS函数开始,该函数现在消耗了一半以上的时间.

So that's it for me for now. Where does this leave you? Assuming a slightly better computer and no further optimization work, even the optimized code would take about a month to process your test images. If you only have to do this once, maybe that's fine. If you need to do it more often, or need to iterate on the code as you try different things, you probably need to keep optimizing--starting with that RBS function which consumes more than half the time now.

奖金提示:如果代码没有像kK这样的几乎完全相同的变量名,也没有使用j作为变量名以及复杂的代码,则代码处理起来会容易得多.数字后缀(0j).

Bonus tip: your code would be a lot easier to deal with if it didn't have nearly-identical variable names like k and K, nor used j as a variable name and also as a complex number suffix (0j).

这篇关于需要帮助矢量化代码或优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆