numpy数组上的余数函数(%)运行时间远比手动余数计算长 [英] Remainder function (%) runtime on numpy arrays is far longer than manual remainder calculation

查看:104
本文介绍了numpy数组上的余数函数(%)运行时间远比手动余数计算长的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在过去的几天中,我一直在努力改善python函数的运行时间,该函数需要对剩余函数(%)进行多次使用.我的主要测试用例是超过80,000个元素的numpy数组(单调递增),具有10000次迭代,尽管我也尝试了其他各种大小.

In the past few days I've been working on improving the runtime of a python function which requires many uses of the remainder function (%) among other things. My main test case is over an 80,000 element numpy array (monotonically increasing), with 10000 iterations, although I've tried on various other sizes as well.

最终,我到达了余数函数成为主要瓶颈的地步,并尝试了各种解决方案.这是我在运行以下代码时发现的行为:

Eventually I reached a point where the remainder function is a major bottleneck, and tried various solutions. This is the behaviour I found when running the following code:

import numpy as np
import time

a = np.random.rand(80000)
a = np.cumsum(a)
d = 3
start_time1 = time.time()
for i in range(10000):
    b = a % d
    d += 0.001
end_time1 = time.time()
d = 3
start_time2 = time.time()
for i in range(10000):
    b = a - (d * np.floor(a / d))
    d += 0.001
end_time2 = time.time()
print((end_time1 - start_time1) / 10000)
print((end_time2 - start_time2) / 10000)

输出为:

0.0031344462633132934
0.00022937238216400147

当数组大小增加到800,000时:

when increasing the array size to 800,000:

0.014903099656105041
0.010498356819152833

(对于这篇文章,我只为实际输出运行了一次代码,同时试图理解这个问题,我一直都得到了这些结果.)

(For this post I ran the code only once for actual output, while trying to understand the issue I've gotten these results consistently.)

这解决了我的运行时问题-我很难理解为什么.我想念什么吗?我能想到的唯一区别是附加函数调用的开销,但是第一种情况非常极端(并且1.5倍的运行时也不够好),如果是这种情况,我会认为np.remainder函数毫无意义.

While this solves my runtime problem - I have a hard time understand why. Am I missing something? The only difference I can think of is the overhead of an additional function call, but the first case is pretty extreme (and 1.5x the runtime isn't good enough either), and if that were the case I would think that the existance of the np.remainder function is pointless.

我尝试使用非numpy循环测试相同的代码:

I tried testing the same code with non-numpy loops:

import numpy as np
import time


def pythonic_remainder(array, d):
    b = np.zeros(len(array))
    for i in range(len(array)):
        b[i] = array[i] % d

def split_pythonic_remainder(array, d):
    b = np.zeros(len(array))
    for i in range(len(array)):
        b[i] = array[i] - (d * np.floor(array[i] / d))

def split_remainder(a, d):
    return a - (d * np.floor(a / d))

def divide(array, iterations, action):
    d = 3
    for i in range(iterations):
        b = action(array, d)
        d += 0.001

a = np.random.rand(80000)
a = np.cumsum(a)
start_time = time.time()
divide(a, 10000, split_remainder)
print((time.time() - start_time) / 10000)

start_time = time.time()
divide(a, 10000, np.remainder)
print((time.time() - start_time) / 10000)
start_time = time.time()
divide(a, 10000, pythonic_remainder)
print((time.time() - start_time) / 10000)

start_time = time.time()
divide(a, 10000, split_pythonic_remainder)
print((time.time() - start_time) / 10000)

我得到的结果是:

0.0003770533800125122
0.003932329940795899
0.018835473942756652
0.10940513386726379

我发现有趣的是,在非numpy情况下情况恰恰相反.

I find it interesting that the opposite is true in the non-numpy case.

推荐答案

我最好的假设是,您的NumPy安装正在%计算中使用未优化的fmod.这就是为什么.

My best hypothesis is that your NumPy install is using an unoptimized fmod inside the % calculation. Here's why.

首先,我无法在NumPy 1.15.1的常规点子安装版本上重现您的结果.我只能得到大约10%的性能差异(asdf.py包含您的计时代码):

First, I can't reproduce your results on a normal pip installed version of NumPy 1.15.1. I get only about a 10% performance difference (asdf.py contains your timing code):

$ python3.6 asdf.py
0.0006543657302856445
0.0006025806903839111

可以通过从NumPy Git存储库的克隆中手动构建(v31.1)版本(python3.6 setup.py build_ext --inplace -j 4)来再现主要的性能差异,

I can reproduce a major performance discrepancy with a manual build (python3.6 setup.py build_ext --inplace -j 4) of v1.15.1 from a clone of the NumPy Git repository, though:

$ python3.6 asdf.py
0.00242799973487854
0.0006397026300430298

这表明我的点子安装版本%比手动构建或您已安装的版本进行了更好的优化.

This suggests that my pip-installed build's % is better optimized than my manual build or what you have installed.

看起来很诱人,想看看

Looking under the hood, it's tempting to look at the implementation of floating-point % in NumPy and blame the slowdown on the unnecessary floordiv calculation (npy_divmod@c@ calculates both // and %):

NPY_NO_EXPORT void
@TYPE@_remainder(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func))
{
    BINARY_LOOP {
        const @type@ in1 = *(@type@ *)ip1;
        const @type@ in2 = *(@type@ *)ip2;
        npy_divmod@c@(in1, in2, (@type@ *)op1);
    }
}

但是在我的实验中,删除floordiv并没有带来任何好处.对于编译器来说,优化似乎很容易,所以也许它已经被优化了,或者也许它只是运行时的一小部分.

but in my experiments, removing the floordiv provided no benefit. It looks easy enough for a compiler to optimize out, so maybe it was optimized out, or maybe it was just a negligible fraction of the runtime in the first place.

让我们只关注npy_divmod@c@中的一行,而不是floordiv:fmod调用:

Rather than the floordiv, let's focus on just one line in npy_divmod@c@, the fmod call:

mod = npy_fmod@c@(a, b);

这是初始的余数计算,在特殊情况下处理并调整结果以匹配右侧操作数的符号.如果在我的手动构建中比较%numpy.fmod的性能:

This is the initial remainder computation, before special-case handling and adjusting the result to match the sign of the right-hand operand. If we compare the performance of % with numpy.fmod on my manual build:

>>> import timeit
>>> import numpy
>>> a = numpy.arange(1, 8000, dtype=float)
>>> timeit.timeit('a % 3', globals=globals(), number=1000)
0.3510419335216284
>>> timeit.timeit('numpy.fmod(a, 3)', globals=globals(), number=1000)
0.33593094255775213
>>> timeit.timeit('a - 3*numpy.floor(a/3)', globals=globals(), number=1000)
0.07980139832943678

我们看到fmod似乎负责%的几乎整个运行时间.

We see that fmod appears to be responsible for almost the entire runtime of %.

我没有分解生成的二进制文件,也没有在指令级调试器中逐步查看生成的二进制文件,以查看执行的是什么,当然,我无权访问您的机器或您的NumPy的副本.尽管如此,从以上证据来看,fmod似乎很可能是罪魁祸首.

I haven't disassembled the generated binary or stepped through it in an instruction-level debugger to see exactly what gets executed, and of course, I don't have access to your machine or your copy of NumPy. Still, from the above evidence, fmod seems like a pretty likely culprit.

这篇关于numpy数组上的余数函数(%)运行时间远比手动余数计算长的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆