pandas groupby ffill bfill 需要中间 groupby 吗? [英] pandas groupby ffill bfill needs intermediate groupby?

查看:75
本文介绍了 pandas groupby ffill bfill 需要中间 groupby 吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过在一个列上分组然后填充 (bfill().ffill()) 组内列的子集来弥补数据框中缺失的数据.

I'm trying to paper over missing data in a dataframe by grouping on one column and then flood-filling (bfill().ffill()) subsets of columns inside the groups.

我以前使用过

def ffbf(x):
   return x.ffill().bfill()

df[some_cols] = df.groupby(group_key)[some_cols].transform(ffbf)

但是即使在相对较小的数据帧上转换也会变得令人难以置信慢(对于只有 3000x20 已经几秒钟),所以我想看看我是否可以直接对组应用填充和填充现在被cythonized.

but transform becomes unbelievably slow even on relatively small dataframes (already several seconds for only 3000x20), so I wanted to see if I could apply ffill and bfill directly to the groups since they're supposed to be cythonized now.

我认为我需要在 ffill 和 bfill 之间再次调用 groupby 是否正确,因为这两种方法都不保留分组?

Am I correct in thinking that I need to invoke groupby again in between ffill and bfill because neither method preserves the groupings?

现在我有

df[some_cols] = df[some_cols].groupby(group_key).ffill().groupby(group_key).bfill()

而且我认为它正在做我想做的事情,而且比使用转换快 waaaaaaayyy,但我对熊猫的经验不够确定,所以我想我会问.

and I think that it's doing what I want, and it's waaaaaaayyy faster than using transform, but I'm not experienced enough with pandas to be certain, so I figured I'd ask.

[edit] 看起来这个变化让我的数据变得混乱.为什么?

[edit] It looks like this change is jumbling my data. Why?

推荐答案

我认为这里有必要另外一个 groupbybfill 以避免替换 NaNs 仅用于来自另一个的 NaNs 组.

I my opinion here is necessary another groupby with bfill for avoid replace NaNs for only NaNs group from another one.

为了提高性能,使用此代码:

For performance is used this code:

In [205]: %timeit df1[some_cols] = df1.groupby(group_key)[some_cols].transform(ffbf)
443 ms ± 7.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [206]: %timeit df[[group_key] + some_cols] = df[[group_key] + some_cols].groupby(group_key).ffill().groupby(group_key).bfill()
5.69 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


np.random.seed(785)

N = 10000
df = pd.DataFrame({'key':np.random.randint(1000, size=N),
                   'A':np.random.choice([1,2,np.nan], size=N),
                   'B':np.random.choice([1,4,np.nan], size=N),
                   'C':np.random.choice([7,0,np.nan], size=N),
                   'D':np.random.choice([7,0,8], size=N)})

df = df.sort_values('key')
print (df)

def ffbf(x):
   return x.ffill().bfill()

group_key = 'key'
some_cols = ['A','B','C']
df1 = df.copy()
df1[some_cols] = df1.groupby(group_key)[some_cols].transform(ffbf)

#a bit chamgef solution for working in pandas 0.23.1
df[[group_key] + some_cols] = df[[group_key] + some_cols].groupby(group_key).ffill().groupby(group_key).bfill()

print (df.equals(df1))
True

在下一个熊猫版本(测试熊猫1.1.1)中可以使用:

In next pandas versions (test pandas 1.1.1) is possible use:

df[[group_key] + some_cols] = df[[group_key] + some_cols].groupby(df[group_key]).ffill().groupby(df[group_key]).bfill()

这篇关于 pandas groupby ffill bfill 需要中间 groupby 吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆