为 pandas DataFrame滚动idxmin/max [英] Rolling idxmin/max for pandas DataFrame

查看:87
本文介绍了为 pandas DataFrame滚动idxmin/max的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我相信以下函数是熊猫DataFrame滚动argmin/max的有效解决方案:

I believe the following function is a working solution for pandas DataFrame rolling argmin/max:

import numpy as np

def data_frame_rolling_arg_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1
    return (df.rolling(ws).apply(getattr(np, f'arg{func}'))[wm1:].astype(int) +
            np.array([np.arange(len(df) - wm1)]).T).applymap(
                lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))

它的灵感来自在熊猫系列上滚动idxmax的部分解决方案.

说明:

  • 将numpy argmin/max函数应用于滚动窗口.
  • 仅保留非NaN值.
  • 将值转换为 int .
  • 将值重新调整为原始行号.
  • 使用 applymap 将行号替换为索引值.
  • 将原始 DataFrame 填充为 NaN ,以便添加具有预期的 NaN 值的第一行.
  • Apply the numpy argmin/max function to the rolling window.
  • Only keep the non-NaN values.
  • Convert the values to int.
  • Realign the values to original row numbers.
  • Use applymap to replace the row numbers by the index values.
  • Combine with the original DataFrame filled with NaN in order to add the first rows with expected NaN values.

In [1]: index = map(chr, range(ord('a'), ord('a') + 10))

In [2]: df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)

In [3]: df                                                                                                                                                                                                                                                                       
Out[3]: 
    0   1   2
a  -4  15   0
b   0  -6   4
c   7   8 -18
d  11  12 -16
e   6   3  -6
f  -1   4  -9
g   6 -10  -7
h   8  11 -25
i  -2 -10  -8
j   0  10  -7

In [4]: data_frame_rolling_arg_func(df, 3, 'max')                                                                                                                                                                                                                                
Out[4]: 
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    c    a    b
d    d    d    b
e    d    d    e
f    d    d    e
g    e    f    e
h    h    h    g
i    h    h    g
j    h    h    j

In [5]: data_frame_rolling_arg_func(df, 3, 'min')                                                                                                                                                                                                                                
Out[5]: 
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    b    c
d    b    b    c
e    e    e    c
f    f    e    d
g    f    g    f
h    f    g    h
i    i    g    h
j    i    i    h

我的问题是:

  • 您能找到任何错误吗?
  • 是否有更好的解决方案?那就是:更高效和/或更优雅.

对于在那里的熊猫维护者:如果已经很好的熊猫库包含滚动的idxmax和idxmin,那就太好了.

And for pandas maintainers out there: it would be nice if the already great pandas library included rolling idxmax and idxmin.

推荐答案

我上一个答案中的解决方案设法为 NaN 输入值提供适当的索引值,但是我已经意识到这很可能是不是默认情况下滚动 idxmin / idxmax 的本地大熊猫的行为.默认情况下,如果窗口中存在一个或多个 NaN 值,它将生成一个 NaN 值.

The solution in my previous answer manages to give proper index values for NaN input values, but I have realized that this is most probably not what a native pandas rolling idxmin/idxmax would do by default. By default, it would produce a NaN value if there is one or more NaN values in the window.

我想出了我的解决方案的一种变体,它可以做到:

I came up with a variant of my solution, which does that:

import numpy as np
import pandas as pd


def transform_if_possible(func):
    def f(i):
        try:
            return func(i)
        except ValueError:
            return i
    return f


int_if_possible = transform_if_possible(int)


def data_frame_rolling_idx_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1

    index_if_possible = transform_if_possible(lambda i: df.index[i])

    return (df.rolling(ws).apply(getattr(np, f'arg{func}'), raw=True).applymap(int_if_possible) +
            np.array([np.arange(len(df)) - wm1]).T).applymap(index_if_possible)


def main():
    print(int_if_possible(1.2))
    print(int_if_possible(np.NaN))
    index = map(chr, range(ord('a'), ord('a') + 10))
    df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
    df[0][3:6] = np.NaN
    print(df)
    print(data_frame_rolling_idx_func(df, 3, 'min'))
    print(data_frame_rolling_idx_func(df, 3, 'max'))


if __name__ == "__main__":
    main()

结果:

1
nan
      0   1   2
a  15.0  -2  13
b  -6.0  -4  -3
c -12.0  -7  -8
d   NaN   0  -4
e   NaN  -1 -11
f   NaN  -9  10
g  -1.0  24   1
h -15.0  14 -16
i   7.0  -4  14
j  -1.0   4  10
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    c    c    c
d  NaN    c    c
e  NaN    c    e
f  NaN    f    e
g  NaN    f    e
h  NaN    f    h
i    h    i    h
j    h    i    h
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    a    a
d  NaN    d    b
e  NaN    d    d
f  NaN    d    f
g  NaN    g    f
h  NaN    g    f
i    i    g    i
j    i    h    i

为了实现我的目标,我正在使用两个函数分别将值转换为整数,将行号转换为索引值,这使 NaN 保持不变.我借助通用闭包 transform_if_possible 构造这些函数.在第二种情况下,由于索引转换依赖于 DataFrame ,所以我从本地lambda函数构造了转换函数.

To achieve my goal, I am using two functions to transform values into integers, and row numbers into index values, respectively, which leave NaN unchanged. I construct these functions with the help of a common closure, transform_if_possible. In the second case, since the index transformation is dependent on the DataFrame, I construct the transformation function from a local lambda function.

除了这些方面之外,该解决方案类似于我的上一个,但是由于 NaN >是显式处理的,我知道不再需要对前 window_size-1 行进行特殊处理,因此代码要短一些.

Apart from these aspects, the solution is similar to my previous one, but since NaN is explicitly handled, I know longer need a special handling of the first window_size - 1 rows, so the code is a little shorter.

此解决方案的一个不错的副作用是运行时间似乎更短:对应滚动 min / max 的运行时间的三倍多,而不是五次.

A nice side effect of this solution is that the running time seems to be lower: a little over three times the running time of the corresponding rolling min/max, instead of five times.

总而言之,我认为这是一个更好的解决方案.

All in all, a better solution I think.

这篇关于为 pandas DataFrame滚动idxmin/max的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆