python中的并行化(非对称)循环 [英] parallelize (not symmetric) loops in python

查看:155
本文介绍了python中的并行化(非对称)循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码是用python编写的,并且可以正常工作,即返回预期结果.但是,它非常慢,我认为可以对其进行优化.

The following code is written in python and it works, i.e. returns the expected result. However, it is very slow and I believe that can be optimized.

G_tensor = numpy.matlib.identity(N_particles*3,dtype=complex)

for i in range(N_particles):
    for j in range(i, N_particles):
        if i != j:

            #Do lots of things, here is shown an example.
            # However you should not be scared because 
            #it only fills the G_tensor
            R = numpy.linalg.norm(numpy.array(positions[i])-numpy.array(positions[j]))
            rx = numpy.array(positions[i][0])-numpy.array(positions[j][0])
            ry = numpy.array(positions[i][1])-numpy.array(positions[j][1])
            rz = numpy.array(positions[i][2])-numpy.array(positions[j][2])
            krq = (k*R)**2
            pf = -k**2*alpha*numpy.exp(1j*k*R)/(4*math.pi*R)
            a = 1.+(1j*k*R-1.)/(krq)
            b = (3.-3.*1j*k*R-krq)/(krq) 
            G_tensor[3*i+0,3*j+0] = pf*(a + b * (rx*rx)/(R**2))  #Gxx
            G_tensor[3*i+1,3*j+1] = pf*(a + b * (ry*ry)/(R**2))  #Gyy
            G_tensor[3*i+2,3*j+2] = pf*(a + b * (rz*rz)/(R**2))  #Gzz
            G_tensor[3*i+0,3*j+1] = pf*(b * (rx*ry)/(R**2))      #Gxy
            G_tensor[3*i+0,3*j+2] = pf*(b * (rx*rz)/(R**2))      #Gxz
            G_tensor[3*i+1,3*j+0] = pf*(b * (ry*rx)/(R**2))      #Gyx
            G_tensor[3*i+1,3*j+2] = pf*(b * (ry*rz)/(R**2))      #Gyz
            G_tensor[3*i+2,3*j+0] = pf*(b * (rz*rx)/(R**2))      #Gzx
            G_tensor[3*i+2,3*j+1] = pf*(b * (rz*ry)/(R**2))      #Gzy

            G_tensor[3*j+0,3*i+0] = pf*(a + b * (rx*rx)/(R**2))  #Gxx
            G_tensor[3*j+1,3*i+1] = pf*(a + b * (ry*ry)/(R**2))  #Gyy
            G_tensor[3*j+2,3*i+2] = pf*(a + b * (rz*rz)/(R**2))  #Gzz
            G_tensor[3*j+0,3*i+1] = pf*(b * (rx*ry)/(R**2))      #Gxy
            G_tensor[3*j+0,3*i+2] = pf*(b * (rx*rz)/(R**2))      #Gxz
            G_tensor[3*j+1,3*i+0] = pf*(b * (ry*rx)/(R**2))      #Gyx
            G_tensor[3*j+1,3*i+2] = pf*(b * (ry*rz)/(R**2))      #Gyz
            G_tensor[3*j+2,3*i+0] = pf*(b * (rz*rx)/(R**2))      #Gzx
            G_tensor[3*j+2,3*i+1] = pf*(b * (rz*ry)/(R**2))      #Gzy

您知道如何并行化吗?您应该注意,这两个循环不是对称的.

Do you know how can I parallelize it? You should note that the two loops are not symmetric.

编辑一个:上面介绍了一个numpythonic解决方案,我在c ++实现,我的python循环版本和thr numpythonic之间进行了比较.结果如下: -c ++ = 0.14seg -numpythonic版本= 1.39seg -python循环版本= 46.56seg 如果使用numpy的intel版本,结果可能会更好.

Edit one: A numpythonic solution was presented above and I made a comparison between the c++ implementation, my loop version in python and thr numpythonic. Results are the following: - c++ = 0.14seg - numpythonic version = 1.39seg - python loop version = 46.56seg Probably results can get better if we use the intel version of numpy.

推荐答案

Python不是一种快速的语言.使用python进行数字运算时应始终将时间紧迫的零件代码用于以编译语言编写的代码.通过将编译降低到CPU级别,您可以将代码加速多达100倍,然后仍然可以进行并行化.因此,我不会期望使用更多的内核来执行低效的工作,而是要提高工作效率.我看到以下加快代码速度的方法:

Python is not a fast language. Number crunching with python should always use for time critical parts code written in a compiled language. With compilation down to the CPU level you can speed up the code by a factor up to 100 and then still go for parallelization. So I would not look down to using more cores doing inefficient stuff, but to work more efficient. I see the following ways to speed up the code:

1)更好地使用numpy:您可以直接在向量/矩阵级上在标量级上进行计算吗?例如. rx = positions [:,0] -positions [0,:](如果正确,则不检查),但沿这些方向存在.

1) Better use of numpy: Can you do your calculations instead on scalar level directly on vector/matrix level? eg. rx = positions[:,0]-positions[0,:] (not checked if that is correct) but something along those lines.

如果您的计算方式无法做到这一点,那么您可以选择选项2或3

If that is not possible with your kind of calculations, than you can go for option 2 or 3

2)使用cython. Cython将Python代码编译为C,然后将其编译到您的CPU.通过在正确的位置使用静态键入,您可以使代码更快,请参见cython教程,例如: http://cython.readthedocs.io/en/latest/src/quickstart/cythonize.html

2) Use cython. Cython compiles Python code to C, which is then compiled to your CPU. By using static typing at the right places you can make your code much faster, see cython tutorials eg.: http://cython.readthedocs.io/en/latest/src/quickstart/cythonize.html

3)如果您熟悉FORTRAN,最好在FORTRAN中编写此部分,然后使用f2py从Python调用它.实际上,您的代码无论如何看起来都非常像FORTRAN.对于C和C ++,SWIG是使编译后的代码在Python中可用的一种很好的工具,但是还有很多其他技术(cython,Boost :: Python,ctypes,numba等)

3) If you are familiar with FORTRAN, it might be a good idea to write just this part in FORTRAN and then call it from Python using f2py. In fact, your code looks a lot like FORTRAN anyway. For C and C++ SWIG is one great tool to make compiled code available in Python, but there are plenty of other techniques (cython, Boost::Python, ctypes, numba etc.)

完成此操作后,仍然很慢,可以选择将GPU功能与pyCUDA一起使用,或将mpi4py与并行化或进行多处理.

When you have done this, and it is still to slow, using GPU power with pyCUDA or parallelization with mpi4py or multiprocessing might be an option.

这篇关于python中的并行化(非对称)循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆