具有基于索引的限制的正向填充列 [英] Forward fill column with an index-based limit

查看：50 发布时间：2021/6/13 20:46:53 python pandas dataframe pandas-groupby imputation

本文介绍了具有基于索引的限制的正向填充列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想向前填充一列，我想指定一个限制，但我希望该限制基于索引——而不是像限制允许的简单行数.

I want to forward fill a column and I want to specify a limit, but I want the limit to be based on the index---not a simple number of rows like limit allows.

例如，假设我有以下给出的数据框:

For example, say I have the dataframe given by:

df = pd.DataFrame({
    'data': [0.0, 1.0, np.nan, 3.0, np.nan, 5.0, np.nan, np.nan, np.nan, np.nan],
    'group': [0, 0, 0, 1, 1, 0, 0, 0, 1, 1]
})

看起来像

In [27]: df
Out[27]:
   data  group
0   0.0      0
1   1.0      0
2   NaN      0
3   3.0      1
4   NaN      1
5   5.0      0
6   NaN      0
7   NaN      0
8   NaN      1
9   NaN      1

如果我按 group 列分组并用 limit=2 向前填充该组，那么我得到的数据框将是

If I group by the group column and forward fill in that group with limit=2, then my resulting data frame will be

In [35]: df.groupby('group').ffill(limit=2)
Out[35]:
   group  data
0      0   0.0
1      0   1.0
2      0   1.0
3      1   3.0
4      1   3.0
5      0   5.0
6      0   5.0
7      0   5.0
8      1   3.0
9      1   NaN

然而，我在这里真正想做的只是向前填充索引在每个组的第一个索引的 2 以内的行，而不是每个组的下 2 行.例如，如果我们只查看数据框上的组:

What I actually want to do here however is only forward fill onto rows whose indexes are within say 2 from the first index of each group, as opposed to the next 2 rows of each group. For example, if we just look at the groups on the dataframe:

In [36]: for i, group in df.groupby('group'):
    ...:     print(group)
    ...:
   data  group
0   0.0      0
1   1.0      0
2   NaN      0
5   5.0      0
6   NaN      0
7   NaN      0
   data  group
3   3.0      1
4   NaN      1
8   NaN      1
9   NaN      1

我希望这里的第二组只向前填充到索引 4---而不是 8 和 9.第一组的 NaN 值都在距离最后一个非 NaN 值的 2 个索引内，因此它们将被完全填充.生成的数据框如下所示:

I would want the second group here to only be forward filled to index 4---not 8 and 9. The first group's NaN values are all within 2 indexes from the last non-NaN values, so they would be filled completely. The resulting dataframe would look like:

   group  data
0      0   0.0
1      0   1.0
2      0   1.0
3      1   3.0
4      1   3.0
5      0   5.0
6      0   5.0
7      0   5.0
8      1   NaN
9      1   NaN

FWIW 在我的实际用例中，我的索引是 DateTimeIndex(并且它已排序).

FWIW in my actual use case, my index is a DateTimeIndex (and it is sorted).

我目前有一个可行的解决方案，需要循环遍历在组索引上过滤的数据帧，为每个具有基于索引的非 NaN 值的单个事件创建一个时间范围，然后将它们组合起来.但这太慢了，不实用.

I currently have a solution which sort of works, requiring looping through the dataframe filtered on the group indexes, creating a time range for every single event with a non-NaN value based on the index, and then combining those. But this is far too slow to be practical.

推荐答案

import numpy as np
import pandas as pd
df = pd.DataFrame({
    'data': [0.0, 1.0, 1, 3.0, np.nan, 22, np.nan, 5, np.nan, np.nan],
    'group': [0, 0, 1, 0, 1, 0, 1, 0, 1, 1]})

df = df.reset_index()
df['stop_index'] = df['index'] + 2
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
print(df)
#    index  data  group  stop_index   mask
# 0      0   0.0      0         2.0   True
# 1      1   1.0      0         3.0   True
# 2      2   1.0      1         4.0   True
# 3      3   3.0      0         5.0   True
# 4      4   1.0      1         4.0   True
# 5      5  22.0      0         7.0   True
# 6      6   NaN      1         4.0  False
# 7      7   5.0      0         9.0   True
# 8      8   NaN      1         4.0  False
# 9      9   NaN      1         4.0  False

# clean up df
df = df[['data', 'group']]
print(df)

收益

   data  group
0   0.0      0
1   1.0      0
2   1.0      1
3   3.0      0
4   1.0      1
5  22.0      0
6   NaN      1
7   5.0      0
8   NaN      1
9   NaN      1

<小时>

这将索引复制到一列中，然后制作第二个 stop_index 列，它是 index 增加的大小(时间)窗口.

This copies the index into a column, then makes a second stop_index column which is the index augmented by the size of the (time) window.

df = df.reset_index()
df['stop_index'] = df['index'] + 2

然后在stop_index中生成空行以匹配data中的空行:

Then it makes null rows in stop_index to match null rows in data:

df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))

然后它在每个组的基础上向前填充 stop_index:

Then it forward-fills stop_index on a per-group basis:

df['stop_index'] = df.groupby('group')['stop_index'].ffill()

现在(最后)我们可以定义所需的mask——我们实际想要向前填充data的地方:

Now (at last) we can define the desired mask -- the places where we actually want to forward-fill data:

df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()

这篇关于具有基于索引的限制的正向填充列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

具有基于索引的限制的正向填充列 [英] Forward fill column with an index-based limit

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

具有基于索引的限制的正向填充列 [英] Forward fill column with an index-based limit

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭