Pandas 中滚动最大值的 Numpy 版本 [英] Numpy version of rolling maximum in pandas

查看:69
本文介绍了Pandas 中滚动最大值的 Numpy 版本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL;DR:我的问题是如何改进我的函数以超越熊猫自己的最大移动函数?

TL;DR: My question is about how can I improve my function to outperform the pandas own moving maximum function?

背景信息:

所以我正在处理很多移动平均线,移动最大值和移动最小值等,到目前为止我发现的唯一像功能一样的移动窗口在 pandas.rolling 方法.问题是:我拥有的数据是 numpy 数组,我想要的最终结果也必须在 numpy 数组中;就像我想简单地将它转换为熊猫系列并返回到 numpy 数组来完成这样的工作:

So I am working with a lot of moving averages, moving maximum and moving minimum etc, and the only moving windows like features I have found so far are in pandas.rolling method. The thing is: the data I have are numpy arrays and the end result I want must also be in numpy arrays as well; as much as I want to simply convert it to pandas series and back to numpy array to do the job like this:

result2_max = pd.Series(data_array).rolling(window).max().to_numpy()

,因为转换数据类型似乎没有必要,这太非 Python 化了,而且可能有一些方法可以纯粹在 numpy 实现中做完全相同的事情.

, it is way too unpythonic in that converting data types seems unnecessary and there could be ways doing the exact same thing purely in numpy implementation.

然而,尽管它看起来不像 Python,但它比我提出的或在网上看到的任何方法都要快.我将在下面给出一些小基准:

However, as unpythonic as it may seem, it is faster than any approaches I have come up with or seen online. I will give the little benchmarks here below:

import numpy as np
import pandas as pd

def numpy_rolling_max(data, window):

    data = data[::-1]
    data_strides = data.strides[0]

    movin_window = np.lib.stride_tricks.as_strided(data, 
                                                    shape=(data.shape[0] - window +1, window), 
                                                    strides = (data_strides ,data_strides)
                                                    )[::-1]
    max_window =np.amax(movin_window, axis = 1)#this line seems to be the bottleneck


    nan_array = np.full(window - 1, np.nan)
    return np.hstack((nan_array, max_window))


def pandas_rolling_max(data, window):
    return pd.Series(data).rolling(window).max().to_numpy()

length = 120000
window = 190
data = np.arange(length) + 0.5

result1_max = numpy_rolling_max(data, window)#21.9ms per loop
result2_max = pandas_rolling_max(data, window)#5.43ms per loop

result_comparision = np.allclose(result1_max, result2_max, equal_nan = True)

当 arraysize = 120k,window = 190 时,pandas 滚动最大值比 numpy 版本快大约 3 倍.我不知道从哪里开始,因为我已经尽可能多地矢量化了我自己的函数,但它仍然比 Pandas 版本慢得多,我真的不知道为什么.

With arraysize = 120k, window = 190, the pandas rolling maximum is about 3 times faster than then numpy version. I have no clue where to proceed, as I have already vectorized my own function as much as I can, but it is still way slower than the pandas version and I don't really know why.

先谢谢你

我找到了瓶颈,就是这条线:

I have found the bottleneck and it is this line:

max_window =np.amax(movin_window, axis = 1)

但是看到已经是向量化的函数调用了,我还是不知道如何进行.

But seeing that it is already a vectorized function call, I still have no clue how to proceed.

推荐答案

我们可以使用 1D 来自 Scipy 的最大过滤器 复制与 pandas 相同的行为,但仍然多一点高效 -

We can use 1D max filter from Scipy to replicate the same behavior as pandas one and still be a bit more efficient -

from scipy.ndimage.filters import maximum_filter1d

def max_filter1d_same(a, W, fillna=np.nan):
    out_dtype = np.full(0,fillna).dtype
    hW = (W-1)//2 # Half window size
    out = maximum_filter1d(a,size=W, origin=hW)
    if out.dtype is out_dtype:
        out[:W-1] = fillna
    else:
        out = np.concatenate((np.full(W-1,fillna), out[W-1:]))
    return out

样品运行 -

In [161]: np.random.seed(0)
     ...: a = np.random.randint(0,999,(20))
     ...: window = 3

In [162]: a
Out[162]: 
array([684, 559, 629, 192, 835, 763, 707, 359,   9, 723, 277, 754, 804,
       599,  70, 472, 600, 396, 314, 705])

In [163]: pd.Series(a).rolling(window).max().to_numpy()
Out[163]: 
array([ nan,  nan, 684., 629., 835., 835., 835., 763., 707., 723., 723.,
       754., 804., 804., 804., 599., 600., 600., 600., 705.])

In [164]: max_filter1d_same(a,window)
Out[164]: 
array([ nan,  nan, 684., 629., 835., 835., 835., 763., 707., 723., 723.,
       754., 804., 804., 804., 599., 600., 600., 600., 705.])

# Use same dtype fillna for better memory efficiency
In [165]: max_filter1d_same(a,window,fillna=0)
Out[165]: 
array([  0,   0, 684, 629, 835, 835, 835, 763, 707, 723, 723, 754, 804,
       804, 804, 599, 600, 600, 600, 705])

实际测试用例大小的时间 -

Timings on actual test-cases sizes -

In [171]: # Actual test-cases sizes
     ...: np.random.seed(0)
     ...: data_array = np.random.randint(0,999,(120000))
     ...: window = 190

In [172]: %timeit pd.Series(data_array).rolling(window).max().to_numpy()
100 loops, best of 3: 4.43 ms per loop

In [173]: %timeit max_filter1d_same(data_array,window)
100 loops, best of 3: 1.95 ms per loop

这篇关于Pandas 中滚动最大值的 Numpy 版本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆