使用GPU PyOpenCL优化python代码的不同方法:内核GPU/PyOpenCL内部的extern函数 [英] Different ways to optimize with GPU PyOpenCL a python code : extern function inside kernel GPU/PyOpenCL

查看:123
本文介绍了使用GPU PyOpenCL优化python代码的不同方法:内核GPU/PyOpenCL内部的extern函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用以下命令来分析我的Python代码:

I have used the following command to profile my Python code :

python2.7 -m cProfile -o X2_non_flat_multiprocessing_dummy.prof X2_non_flat.py

然后,我可以全局可视化不同贪婪函数的重新划分:

Then, I can visualize globally the repartition of different greedy functions :

如您所见,Pobs_Cinterpolate例程花费了大量时间,这些例程对应于以下代码片段:

As you can see, a lot of time is spent into Pobs_C and interpolate routine which corresponds to the following code snippet :

def Pobs_C(z, zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T, R_T, DG_T_fid, DG_T, WGT_T, WT_T, WIAT_T, cl, P_dd_spec, RT500):
    cc = 0
    P_dd_ok = np.zeros(len(z_pk))
    while cc < len(z_pk):
        if ((cl+0.5)/RT500[cc] < 35 and (cl+0.5)/RT500[cc] > 0.0005):
            P_dd_ok[cc] = P_dd_spec[cc]((cl+0.5)/RT500[cc])
        cc=cc+1

    P_dd_ok = CubicSpline(z_pk, P_dd_ok)
    if paramo == 8:
        P_dd_ok = P_dd_ok(z)*(DG_T(z)/DG_T_fid(z))**2
    else:
        P_dd_ok = P_dd_ok(z)

    if paramo != 9 or paramo != 10 or paramo != 11:
        C_gg = c/(100.*h_p)*0.5*delta_zpm*np.sum((F_dd_GG(z[1:], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, E_T(z[1:]), R_T(z[1:]), WGT_T[aa][1:], WGT_T[bb][1:], DG_T(z[1:]), P_dd_ok[1:]) + F_dd_GG(z[:-1], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, E_T(z[:-1]), R_T(z[:-1]), WGT_T[aa][:-1], WGT_T[bb][:-1], DG_T(z[:-1]), P_dd_ok[:-1]))) + P_shot_GC(zi, zj)
    else:
        C_gg = 0.
    if paramo < 12:
        C_ee = c/(100.*h_p)*0.5*delta_zpm*(np.sum(F_dd_LL(z[1:], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, E_T(z[1:]), R_T(z[1:]), WT_T[aa][1:], WT_T[bb][1:], DG_T(z[1:]), P_dd_ok[1:]) + F_dd_LL(z[:-1], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, E_T(z[:-1]), R_T(z[:-1]), WT_T[aa][:-1], WT_T[bb][:-1], DG_T(z[:-1]), P_dd_ok[:-1])) + np.sum(F_IA_d(z[1:], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[1:]), R_T(z[1:]), DG_T(z[1:]), WT_T[aa][1:], WT_T[bb][1:], WIAT_T[aa][1:], WIAT_T[bb][1:], P_dd_ok[1:]) + F_IA_d(z[:-1], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[:-1]), R_T(z[:-1]), DG_T(z[:-1]), WT_T[aa][:-1], WT_T[bb][:-1], WIAT_T[aa][:-1], WIAT_T[bb][:-1], P_dd_ok[:-1])) + np.sum(F_IAIA(z[1:], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[1:]), R_T(z[1:]), DG_T(z[1:]), WIAT_T[aa][1:], WIAT_T[bb][1:], P_dd_ok[1:]) + F_IAIA(z[:-1], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[:-1]), R_T(z[:-1]), DG_T(z[:-1]), WIAT_T[aa][:-1], WIAT_T[bb][:-1], P_dd_ok[:-1]))) + P_shot_WL(zi, zj)
    else:
        C_ee = 0.
    C_gl = c/(100.*h_p)*0.5*delta_zpm*np.sum((F_dd_GL(z[1:], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[1:]), R_T(z[1:]), DG_T(z[1:]), WGT_T[aa][1:], WT_T[bb][1:], WIAT_T[bb][1:], P_dd_ok[1:]) + F_dd_GL(z[:-1], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[:-1]), R_T(z[:-1]), DG_T(z[:-1]), WGT_T[aa][:-1], WT_T[bb][:-1], WIAT_T[bb][:-1], P_dd_ok[:-1])))
    return C_gg, C_ee, C_gl

1)主要问题:有没有一种方法可以在此例程中实现GPU/OpenCL层,尤其是对于CubicSpline或整个Pobs_C函数. 有什么替代方法可以减少我传递到Pobs_C及其内部函数CubicSpline的时间?

1) MAIN QUESTION : Is there a way to implement a GPU/OpenCL layer in this routine, especially for CubicSpline or the whole Pobs_C function. What are the alternatives that would allow me to reduce the time passed into Pobs_C and its inner function CubicSpline ?

我对OpenCL(不是PyOpenCL)的概念很少,例如map-reduce方法或使用经典内核求解Heat 2D equation.

I have few notions with OpenCL (not PyOpenCL) like for example the map-reduce method or solving Heat 2D equation with classical kernel.

2)以前的反馈:

我知道,天真地认为在内核中调用外部函数会带来更高的速度,因为GPU可以实现很多调用,因此我们无法进行优化.相反,我宁愿将不同功能的所有内容都允许进行优化:您是否同意并确认呢?因此,我可以在内核代码中声明对extern函数的调用(我的意思是不在内核中的函数,即经典部分代码(称为Host code吗?)?

I know that we can't have optimization by thinking naively that calling an extern function inside a kernel brings a higher speedup since GPU can achieve a lot of calls. Instead, I should rather put all the content of the different functions allow to get optimization : Do you agree with that and confirm it ? So, can I declare inside the kernel code a call to an extern function (I mean a function not inside kernel, i.e the classical part code (called Host code ?) ?

3)可选问题:也许我可以在内核中声明此extern函数:是否可以通过在内部明确地执行此声明来实现?确实,这可以避免复制可能与GPU关联的所有功能的所有内容.

3) OPTIONAL QUESTION : Maybe can I declare this extern function inside the kernel : is it possible by doing explicitely this declaration inside ? Indeed, that could avoid to copy all the content of all functions potentially GPU-parallilzed.

任何想法/评论/建议都很好,致谢.

Any idea/remark/advice would be great, Regards.

PS: 很抱歉,如果这是一个一般性的话题,但是它使我可以更清楚地了解在上面的代码中包括GPU/OpenCL并进行优化的可用方法.

推荐答案

  1. 是否可以在此例程中实现GPU/OpenCL层,特别是对于CubicSpline或整个Pobs_C函数

很可能没有.分析中的大部分时间似乎是在1200万多项式求值中,并且每次求值调用在CPU上仅花费6微秒.目前尚不清楚在该操作中是否存在显着的令人尴尬的并行性.而且GPU仅在执行令人尴尬的并行任务时有用.

In all probability, no. The majority of time in the profiling seems to be in 12 million polynomial evaluations, and each evaluation call is only taking 6 microseconds on the CPU. It is unclear whether there would be significant embarrassing parallelism to expose in that operation. And GPUs are only useful for performing embarrassingly parallel tasks.

  1. 因此,我可以在内核代码中声明对extern函数的调用吗(我的意思是函数不在内核中,即经典部分代码(称为Host代码?)?

不.那是不可能的.而且,考虑到Python代码无论如何必须在主机CPU上运行,很难弄清楚可能带来的好处.

No. That is impossible. And it is difficult to fathom what benefit that could possibly impart given that the Python code has to run on the host CPU anyway.

  1. 也许我可以在内核内部声明这个extern函数:是否可以通过在内部显式地执行此声明来实现?

否.

这篇关于使用GPU PyOpenCL优化python代码的不同方法:内核GPU/PyOpenCL内部的extern函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆