具有基于索引的限制的正向填充列 [英] Forward fill column with an index-based limit
问题描述
我想向前填充一列,我想指定一个限制,但我希望该限制基于索引——而不是像限制允许的简单行数.
I want to forward fill a column and I want to specify a limit, but I want the limit to be based on the index---not a simple number of rows like limit allows.
例如,假设我有以下给出的数据框:
For example, say I have the dataframe given by:
df = pd.DataFrame({
'data': [0.0, 1.0, np.nan, 3.0, np.nan, 5.0, np.nan, np.nan, np.nan, np.nan],
'group': [0, 0, 0, 1, 1, 0, 0, 0, 1, 1]
})
看起来像
In [27]: df
Out[27]:
data group
0 0.0 0
1 1.0 0
2 NaN 0
3 3.0 1
4 NaN 1
5 5.0 0
6 NaN 0
7 NaN 0
8 NaN 1
9 NaN 1
如果我按 group
列分组并用 limit=2
向前填充该组,那么我得到的数据框将是
If I group by the group
column and forward fill in that group with limit=2
, then my resulting data frame will be
In [35]: df.groupby('group').ffill(limit=2)
Out[35]:
group data
0 0 0.0
1 0 1.0
2 0 1.0
3 1 3.0
4 1 3.0
5 0 5.0
6 0 5.0
7 0 5.0
8 1 3.0
9 1 NaN
然而,我在这里真正想做的只是向前填充索引在每个组的第一个索引的 2 以内的行,而不是每个组的下 2 行.例如,如果我们只查看数据框上的组:
What I actually want to do here however is only forward fill onto rows whose indexes are within say 2 from the first index of each group, as opposed to the next 2 rows of each group. For example, if we just look at the groups on the dataframe:
In [36]: for i, group in df.groupby('group'):
...: print(group)
...:
data group
0 0.0 0
1 1.0 0
2 NaN 0
5 5.0 0
6 NaN 0
7 NaN 0
data group
3 3.0 1
4 NaN 1
8 NaN 1
9 NaN 1
我希望这里的第二组只向前填充到索引 4---而不是 8 和 9.第一组的 NaN 值都在距离最后一个非 NaN 值的 2 个索引内,因此它们将被完全填充.生成的数据框如下所示:
I would want the second group here to only be forward filled to index 4---not 8 and 9. The first group's NaN values are all within 2 indexes from the last non-NaN values, so they would be filled completely. The resulting dataframe would look like:
group data
0 0 0.0
1 0 1.0
2 0 1.0
3 1 3.0
4 1 3.0
5 0 5.0
6 0 5.0
7 0 5.0
8 1 NaN
9 1 NaN
FWIW 在我的实际用例中,我的索引是 DateTimeIndex(并且它已排序).
FWIW in my actual use case, my index is a DateTimeIndex (and it is sorted).
我目前有一个可行的解决方案,需要循环遍历在组索引上过滤的数据帧,为每个具有基于索引的非 NaN 值的单个事件创建一个时间范围,然后将它们组合起来.但这太慢了,不实用.
I currently have a solution which sort of works, requiring looping through the dataframe filtered on the group indexes, creating a time range for every single event with a non-NaN value based on the index, and then combining those. But this is far too slow to be practical.
推荐答案
import numpy as np
import pandas as pd
df = pd.DataFrame({
'data': [0.0, 1.0, 1, 3.0, np.nan, 22, np.nan, 5, np.nan, np.nan],
'group': [0, 0, 1, 0, 1, 0, 1, 0, 1, 1]})
df = df.reset_index()
df['stop_index'] = df['index'] + 2
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
print(df)
# index data group stop_index mask
# 0 0 0.0 0 2.0 True
# 1 1 1.0 0 3.0 True
# 2 2 1.0 1 4.0 True
# 3 3 3.0 0 5.0 True
# 4 4 1.0 1 4.0 True
# 5 5 22.0 0 7.0 True
# 6 6 NaN 1 4.0 False
# 7 7 5.0 0 9.0 True
# 8 8 NaN 1 4.0 False
# 9 9 NaN 1 4.0 False
# clean up df
df = df[['data', 'group']]
print(df)
收益
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 1.0 1
5 22.0 0
6 NaN 1
7 5.0 0
8 NaN 1
9 NaN 1
<小时>
这将索引复制到一列中,然后制作第二个 stop_index
列,它是 index
增加的大小(时间)窗口.
This copies the index into a column, then
makes a second stop_index
column which is the index
augmented by the size of
the (time) window.
df = df.reset_index()
df['stop_index'] = df['index'] + 2
然后在stop_index
中生成空行以匹配data
中的空行:
Then it makes null rows in stop_index
to match null rows in data
:
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
然后它在每个组的基础上向前填充 stop_index
:
Then it forward-fills stop_index
on a per-group basis:
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
现在(最后)我们可以定义所需的mask
——我们实际想要向前填充data
的地方:
Now (at last) we can define the desired mask
-- the places where we actually want to forward-fill data
:
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
这篇关于具有基于索引的限制的正向填充列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!