cProfile 在调用 numba jit 函数时会增加大量开销 [英] cProfile adds significant overhead when calling numba jit functions

查看:43
本文介绍了cProfile 在调用 numba jit 函数时会增加大量开销的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比较一个纯 Python no-op 函数和一个用 @numba.jit 修饰的 no-op 函数,即:

Compare a pure Python no-op function with a no-op function decorated with @numba.jit, that is:

import numba

@numba.njit
def boring_numba():
    pass

def call_numba(x):
    for t in range(x):
        boring_numba()

def boring_normal():
    pass

def call_normal(x):
    for t in range(x):
        boring_normal()

如果我们用 %timeit 计时,我们得到以下结果:

If we time this with %timeit, we get the following:

%timeit call_numba(int(1e7))
792 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit call_normal(int(1e7))
737 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

一切都非常合理;numba 函数的开销很小,但不多.

All perfectly reasonable; there's a small overhead for the numba function, but not much.

然而,如果我们使用 cProfile 来分析这段代码,我们会得到以下内容:

If however we use cProfile to profile this code, we get the following:

cProfile.run('call_numba(int(1e7)); call_normal(int(1e7))', sort='cumulative')

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     76/1    0.003    0.000    8.670    8.670 {built-in method builtins.exec}
        1    6.613    6.613    7.127    7.127 experiments.py:10(call_numba)
        1    1.111    1.111    1.543    1.543 experiments.py:17(call_normal)
 10000000    0.432    0.000    0.432    0.000 experiments.py:14(boring_normal)
 10000000    0.428    0.000    0.428    0.000 experiments.py:6(boring_numba)
        1    0.000    0.000    0.086    0.086 dispatcher.py:72(compile)

cProfile 认为调用 numba 函数的开销很大.这扩展到真实"代码:我有一个函数,它简单地调用了我的昂贵计算(计算是 numba-JIT 编译的),并且 cProfile 报告说包装函数占用了大约三分之一的总时间.

cProfile thinks there is a massive overhead in calling the numba function. This extends to "real" code: I had a function that simply called my expensive computation (the computation being numba-JIT-compiled), and cProfile reported that the wrapper function was taking around a third of the total time.

我不介意 cProfile 增加一点开销,但是如果它在增加开销的位置上非常不一致,那么它就没有太大帮助.有谁知道为什么会发生这种情况,是否有什么可以解决的问题,和/或是否有任何与 numba 交互不差的替代分析工具?

I don't mind cProfile adding a bit of overhead, but if it's massively inconsistent about where it adds that overhead it's not very helpful. Does anyone know why this happens, whether there is anything that can be done about it, and/or if there are any alternative profiling tools that don't interact badly with numba?

推荐答案

当你创建一个 numba 函数时,你实际上创建了一个 numba Dispatcher 对象.这个对象重定向"对 boring_numba 的调用"到正确的(就类型而言)内部jitted"函数.因此,即使您创建了一个名为 boring_numba 的函数 - 未调用该函数,但调用的是基于您的函数的编译函数.

When you create a numba function you actually create a numba Dispatcher object. This object "re-directs" a "call" to boring_numba to the correct (as far as types are concerned) internal "jitted" function. So even though you created a function called boring_numba - this function isn't called, what is called is a compiled function based on your function.

这样您就可以看到在分析 期间调用了函数 boring_numba(即使不是,调用的是 CPUDispatcher.__call__)Dispatcher 对象需要挂钩到当前线程状态并检查是否有分析器/跟踪器正在运行,如果是",则它看起来像 boring_numba 被调用.最后一步是什么开销,因为它必须为 boring_numba 伪造一个Python 堆栈帧".

Just so you can see that the function boring_numba is called (even though it isn't, what is called is CPUDispatcher.__call__) during profiling the Dispatcher object needs to hook into the current thread state and check if there's a profiler/tracer running and if "yes" it makes it look like boring_numba is called.This last step is what incurs the overhead because it has to fake a "Python stack frame" for boring_numba.

技术性更强:

当你调用 numba 函数 boring_numba 时,它实际上调用了 Dispatcher_Call 这是 call_cfunc 主要区别在于:当你有一个分析器运行时,处理分析器的代码占大多数函数调用(只需将 if (tstate->use_tracing && tstate->c_profilefunc) 分支与正在运行的 else 分支进行比较,如果有没有分析器/跟踪器):

When you call the numba function boring_numba it actually calls Dispatcher_Call which is a wrapper around call_cfunc and here is the major difference: When you have a profiler running the code dealing with a profiler makes up a majority of the function call (just compare the if (tstate->use_tracing && tstate->c_profilefunc) branch with the else branch that is running if there is no profiler/tracer):

static PyObject *
call_cfunc(DispatcherObject *self, PyObject *cfunc, PyObject *args, PyObject *kws, PyObject *locals)
{
    PyCFunctionWithKeywords fn;
    PyThreadState *tstate;
    assert(PyCFunction_Check(cfunc));
    assert(PyCFunction_GET_FLAGS(cfunc) == METH_VARARGS | METH_KEYWORDS);
    fn = (PyCFunctionWithKeywords) PyCFunction_GET_FUNCTION(cfunc);
    tstate = PyThreadState_GET();
    if (tstate->use_tracing && tstate->c_profilefunc)
    {
        /*
         * The following code requires some explaining:
         *
         * We want the jit-compiled function to be visible to the profiler, so we
         * need to synthesize a frame for it.
         * The PyFrame_New() constructor doesn't do anything with the 'locals' value if the 'code's
         * 'CO_NEWLOCALS' flag is set (which is always the case nowadays).
         * So, to get local variables into the frame, we have to manually set the 'f_locals'
         * member, then call `PyFrame_LocalsToFast`, where a subsequent call to the `frame.f_locals`
         * property (by virtue of the `frame_getlocals` function in frameobject.c) will find them.
         */
        PyCodeObject *code = (PyCodeObject*)PyObject_GetAttrString((PyObject*)self, "__code__");
        PyObject *globals = PyDict_New();
        PyObject *builtins = PyEval_GetBuiltins();
        PyFrameObject *frame = NULL;
        PyObject *result = NULL;

        if (!code) {
            PyErr_Format(PyExc_RuntimeError, "No __code__ attribute found.");
            goto error;
        }
        /* Populate builtins, which is required by some JITted functions */
        if (PyDict_SetItemString(globals, "__builtins__", builtins)) {
            goto error;
        }
        frame = PyFrame_New(tstate, code, globals, NULL);
        if (frame == NULL) {
            goto error;
        }
        /* Populate the 'fast locals' in `frame` */
        Py_XDECREF(frame->f_locals);
        frame->f_locals = locals;
        Py_XINCREF(frame->f_locals);
        PyFrame_LocalsToFast(frame, 0);
        tstate->frame = frame;
        C_TRACE(result, fn(PyCFunction_GET_SELF(cfunc), args, kws));
        tstate->frame = frame->f_back;

    error:
        Py_XDECREF(frame);
        Py_XDECREF(globals);
        Py_XDECREF(code);
        return result;
    }
    else
        return fn(PyCFunction_GET_SELF(cfunc), args, kws);
}

我假设这个额外的代码(如果分析器正在运行)会在您使用 cProfile 时减慢功能.

I assume that this extra code (in case a profiler is running) slows down the function when you're cProfile-ing.

有点遗憾的是,当您运行分析器时,numba 函数会增加如此多的开销,但如果您在 numba 函数中执行任何实质性操作,则减速实际上几乎可以忽略不计.如果您还想在 numba 函数中移动 for 循环,那么更是如此.

It's a bit unfortunate that numba function add so much overhead when you run a profiler but that the slowdown will actually be almost negligible if you do anything substantial in the numba function. If you would also move the for loop in a numba function then even more so.

如果您注意到 numba 函数(运行或不运行分析器)花费了太多时间,那么您可能调用它太频繁了.然后你应该检查你是否真的可以在 numba 函数内部移动循环,或者将包含循环的代码包装在另一个 numba 函数中.

If you notice that the numba function (with or without profiler running) takes too much time then you probably call it too often. Then you should check if you can actually move the loop inside the numba function or wrap the code containing the loop in another numba function.

注意:所有这些都是(有点)推测,我实际上并没有用调试符号构建 numba 并在分析器正在运行的情况下分析 C 代码.然而,在运行分析器的情况下的操作量使这看起来非常合理.所有这些都假设 numba 0.39,不确定这是否也适用于过去的版本.

Note: All of this is (a bit) speculation, I haven't actually build numba with debug symbols and profiled the C-Code in case a profiler is running. However the amount of operations in case there ise a profiler running makes this seem very plausible. And all of this assumes numba 0.39, not sure if this applies to past versions as well.

这篇关于cProfile 在调用 numba jit 函数时会增加大量开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆