使用groupby在大型数据帧上有效地进行Fillna(正向填充)? [英] Fillna (forward fill) on a large dataframe efficiently with groupby?

查看:236
本文介绍了使用groupby在大型数据帧上有效地进行Fillna(正向填充)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在大型数据框中转发填充信息的最有效方法是什么?

What is the most efficient way to forward fill information in a large dataframe?

我合并了每日文件中约600万行x 50列的维度数据.我删除了重复项,现在有大约200,000行唯一数据,这些数据可以跟踪其中一个维度发生的任何更改.

I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.

不幸的是,一些原始数据被弄乱了并且具有空值.如何有效地使用以前的值填充空数据?

Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?

id       start_date   end_date    is_current  location  dimensions...
xyz987   2016-03-11   2016-04-02  Expired       CA      lots_of_stuff
xyz987   2016-04-03   2016-04-21  Expired       NaN     lots_of_stuff
xyz987   2016-04-22          NaN  Current       CA      lots_of_stuff

这是数据的基本形状.问题是某些尺寸不应为空白(这是原始数据中的错误).一个示例是,对于先前的行,该位置已为该行填写,但在下一行中为空白.我知道位置没有改变,但是因为它是空白,所以它正在将其捕获为唯一行.

That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.

我假设我需要使用ID字段进行分组.这是正确的语法吗?我是否需要列出数据框中的所有列?

I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?

cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)

在200,000行数据框中,大约有75,000个唯一ID.我尝试过

There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a

df.fillna(method='ffill', inplace=True)

但是我需要根据ID进行操作,并且我想确保自己尽可能高效(我的计算机花了很长时间读取并将所有这些文件整合到内存中).

but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).

推荐答案

向前填充每个组怎么样?

How about forward filling each group?

 df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())

这篇关于使用groupby在大型数据帧上有效地进行Fillna(正向填充)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆