为 pandas DataFrame滚动idxmin/max [英] Rolling idxmin/max for pandas DataFrame
问题描述
我相信以下函数是熊猫DataFrame滚动argmin/max的有效解决方案:
I believe the following function is a working solution for pandas DataFrame rolling argmin/max:
import numpy as np
def data_frame_rolling_arg_func(df, window_size, func):
ws = window_size
wm1 = window_size - 1
return (df.rolling(ws).apply(getattr(np, f'arg{func}'))[wm1:].astype(int) +
np.array([np.arange(len(df) - wm1)]).T).applymap(
lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))
它的灵感来自在熊猫系列上滚动idxmax的部分解决方案.
说明:
- 将numpy argmin/max函数应用于滚动窗口.
- 仅保留非NaN值.
- 将值转换为
int
. - 将值重新调整为原始行号.
- 使用
applymap
将行号替换为索引值. - 将原始
DataFrame
填充为NaN
,以便添加具有预期的NaN
值的第一行.
- Apply the numpy argmin/max function to the rolling window.
- Only keep the non-
NaN
values. - Convert the values to
int
. - Realign the values to original row numbers.
- Use
applymap
to replace the row numbers by the index values. - Combine with the original
DataFrame
filled withNaN
in order to add the first rows with expectedNaN
values.
In [1]: index = map(chr, range(ord('a'), ord('a') + 10))
In [2]: df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
In [3]: df
Out[3]:
0 1 2
a -4 15 0
b 0 -6 4
c 7 8 -18
d 11 12 -16
e 6 3 -6
f -1 4 -9
g 6 -10 -7
h 8 11 -25
i -2 -10 -8
j 0 10 -7
In [4]: data_frame_rolling_arg_func(df, 3, 'max')
Out[4]:
0 1 2
a NaN NaN NaN
b NaN NaN NaN
c c a b
d d d b
e d d e
f d d e
g e f e
h h h g
i h h g
j h h j
In [5]: data_frame_rolling_arg_func(df, 3, 'min')
Out[5]:
0 1 2
a NaN NaN NaN
b NaN NaN NaN
c a b c
d b b c
e e e c
f f e d
g f g f
h f g h
i i g h
j i i h
我的问题是:
- 您能找到任何错误吗?
- 是否有更好的解决方案?那就是:更高效和/或更优雅.
对于在那里的熊猫维护者:如果已经很好的熊猫库包含滚动的idxmax和idxmin,那就太好了.
And for pandas maintainers out there: it would be nice if the already great pandas library included rolling idxmax and idxmin.
推荐答案
我上一个答案中的解决方案设法为 NaN
输入值提供适当的索引值,但是我已经意识到这很可能是不是默认情况下滚动 idxmin
/ idxmax
的本地大熊猫的行为.默认情况下,如果窗口中存在一个或多个 NaN
值,它将生成一个 NaN
值.
The solution in my previous answer manages to give proper index values for NaN
input values, but I have realized that this is most probably not what a native pandas rolling idxmin
/idxmax
would do by default. By default, it would produce a NaN
value if there is one or more NaN
values in the window.
我想出了我的解决方案的一种变体,它可以做到:
I came up with a variant of my solution, which does that:
import numpy as np
import pandas as pd
def transform_if_possible(func):
def f(i):
try:
return func(i)
except ValueError:
return i
return f
int_if_possible = transform_if_possible(int)
def data_frame_rolling_idx_func(df, window_size, func):
ws = window_size
wm1 = window_size - 1
index_if_possible = transform_if_possible(lambda i: df.index[i])
return (df.rolling(ws).apply(getattr(np, f'arg{func}'), raw=True).applymap(int_if_possible) +
np.array([np.arange(len(df)) - wm1]).T).applymap(index_if_possible)
def main():
print(int_if_possible(1.2))
print(int_if_possible(np.NaN))
index = map(chr, range(ord('a'), ord('a') + 10))
df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
df[0][3:6] = np.NaN
print(df)
print(data_frame_rolling_idx_func(df, 3, 'min'))
print(data_frame_rolling_idx_func(df, 3, 'max'))
if __name__ == "__main__":
main()
结果:
1
nan
0 1 2
a 15.0 -2 13
b -6.0 -4 -3
c -12.0 -7 -8
d NaN 0 -4
e NaN -1 -11
f NaN -9 10
g -1.0 24 1
h -15.0 14 -16
i 7.0 -4 14
j -1.0 4 10
0 1 2
a NaN NaN NaN
b NaN NaN NaN
c c c c
d NaN c c
e NaN c e
f NaN f e
g NaN f e
h NaN f h
i h i h
j h i h
0 1 2
a NaN NaN NaN
b NaN NaN NaN
c a a a
d NaN d b
e NaN d d
f NaN d f
g NaN g f
h NaN g f
i i g i
j i h i
为了实现我的目标,我正在使用两个函数分别将值转换为整数,将行号转换为索引值,这使 NaN
保持不变.我借助通用闭包 transform_if_possible
构造这些函数.在第二种情况下,由于索引转换依赖于 DataFrame
,所以我从本地lambda函数构造了转换函数.
To achieve my goal, I am using two functions to transform values into integers, and row numbers into index values, respectively, which leave NaN
unchanged. I construct these functions with the help of a common closure, transform_if_possible
. In the second case, since the index transformation is dependent on the DataFrame
, I construct the transformation function from a local lambda function.
除了这些方面之外,该解决方案类似于我的上一个,但是由于 NaN
>是显式处理的,我知道不再需要对前 window_size-1
行进行特殊处理,因此代码要短一些.
Apart from these aspects, the solution is similar to my previous one, but since NaN
is explicitly handled, I know longer need a special handling of the first window_size - 1
rows, so the code is a little shorter.
此解决方案的一个不错的副作用是运行时间似乎更短:对应滚动 min
/ max
的运行时间的三倍多,而不是五次.
A nice side effect of this solution is that the running time seems to be lower: a little over three times the running time of the corresponding rolling min
/max
, instead of five times.
总而言之,我认为这是一个更好的解决方案.
All in all, a better solution I think.
这篇关于为 pandas DataFrame滚动idxmin/max的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!