Pandas:如何在滚动窗口中选择一列 [英] Pandas: How to select a column in rolling window

查看:145
本文介绍了Pandas:如何在滚动窗口中选择一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框(包含 'a'、'b'、'c' 列),我正在其上执行滚动窗口.

I have a dataframe (with columns 'a', 'b', 'c') on which I am doing a rolling-window.

我希望能够使用如下所示的应用函数中的一列(比如a")过滤滚动窗口

I want to be able to filter the rolling window using one of the columns (say 'a') in the apply function like below

df.rolling(len(s),min_periods=0).apply(lambda x: x[[x['a']>10][0] if len(x[[x['a']>10]]) >=0 else np.nan)

上一行的目的是选择滚动窗口中'a'列的值大于10的第一行.如果没有这样的行,则返回nan.

The intention of above line is to select the first row in the rolling window whose 'a' column has value greater than 10. If there is no such row, then return nan.

但我无法这样做并出现以下错误

But I am unable to do so and get the following error

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

这意味着我根本不允许通过这种语法访问各个列.有没有其他方法可以做这种事情?

This means that I am not allowed to access the individual columns at all by this syntax. Is there any other way of doing this kind of thing?

推荐答案

你的错误源于假设 apply 里面的函数是一个数据帧,它实际上是一个 ndarray 而不是数据帧.

Your error stems from assuming what comes to the function inside apply is a dataframe, it is actually a ndarray not a dataframe.

Pandas 数据框 apply 适用于数据框的每一列/系列,因此任何传递给 apply 的函数都沿着每一列/系列应用,就像一个内部 lambda.在窗口数据帧的情况下,apply 获取每个窗口内的每个列/系列,并作为 ndarray 传递给函数,并且该函数必须仅返回每个窗口每个系列的长度为 1 的数组.知道这一点可以节省很多痛苦.

Pandas dataframe apply works on each column/series of the dataframe, so any function passed to apply is applied along each column/series like an internal lambda. In case of windowed dataframe, apply takes each column/series inside the each window and passes to the function as ndarray and the function has to return only array of length 1 per one series per one window. Knowing this saves a lot of pain.

所以在你的情况下你不能使用任何应用,除非你有一个复杂的函数来记住每个窗口的 a 系列的第一个值.

so in your case you cannot use any apply unless you have a complex function that remembers first value of the series a for each window.

对于 OP 的情况,如果窗口的一列说 a 满足条件,请说 >10

For OP's case if a column of the window say a is meeting a condition, say > 10

  1. 对于窗口第一行a满足条件的情况,与在数据帧中搜索df[df['a']>10]代码>.

  1. For case where a in the first row of a window meets condition it is same as searching in dataframe df[df['a']>10].

对于其他条件,例如窗口第二行中的 a>10,除了数据框的第一个窗口外,检查整个数据框都有效.

For other conditions like a in second row of a window is > 10, checking the entire dataframe works except for the first window of the dataframe.

以下示例展示了另一种解决方法.

Following example demonstrates another way to solution.

import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,20,size=(20, 4)), columns=list('abcd'))

df 看起来像

    a   b   b   d
0   13  2   2   6
1   17  19  10  1
2   0   17  15  9
3   0   14  0   15
4   19  14  4   0
5   16  4   17  3
6   2   7   2   15
7   16  7   9   3
8   6   1   2   1
9   12  8   3   10
10  5   0   11  2
11  10  13  18  4
12  15  11  12  6
13  13  19  16  6
14  14  7   11  7
15  1   11  5   18
16  17  12  18  17
17  1   19  12  9
18  16  17  3   3
19  11  7   9   2

现在选择一个窗口,如果 a 的滚动窗口内的第二行满足条件 a >10 就像 OP 的问题.

now to select a window if second row inside rolling window of a meets a condition a > 10 like in OP's question.

roll_window=5
search_index=1

df_roll = df['a'].rolling(roll_window)
df_y = df_roll.apply(lambda x:x[1] if x[1] > 10 else np.nan).dropna()

以上行返回窗口第二行中与条件 a 对应的 a 的所有值,大于 10.请注意,基于上面的示例数据帧,这些值是正确的,但索引由滚动窗口的居中方式定义.

above line returns all values of a corresponding to condition a in second row of a window greater then 10. Note the values are right based on example dataframe above but the indexes are defined by how rolling window was centered.

4     17.0
7     19.0
8     16.0
10    16.0
12    12.0
15    15.0
16    13.0
17    14.0
19    17.0

在第一个数据框中获取正确的索引位置和整行

to get the right index location and entire row inside the first dataframe

df.loc[df_y.index+searchindex-rollwindow+1]

返回

    a   b   b   d
1   17  19  10  1
4   19  14  4   0
5   16  4   17  3
7   16  7   9   3
9   12  8   3   10
12  15  11  12  6
13  13  19  16  6
14  14  7   11  7
16  17  12  18  17

也可以使用 np.array(df) 制作一个对应滚动窗口的滚动切片,并相应地使用切片过滤数组.

one could also use np.array(df) and make a rolling slice corresponding to rolling window and filter the array using slices correspondingly.

这篇关于Pandas:如何在滚动窗口中选择一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆