pandas 填补绩效问题 [英] pandas fill forward performance issue

查看:161
本文介绍了 pandas 填补绩效问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有多索引(日期,InputTime)的数据框,并且此数据框的列(值,Id)中可能包含一些NA值.我只想填写值,但只能按日期填写,我仍然找不到以一种非常有效的方式做到这一点.

I have a dataframe with a multi index (Date, InputTime) and this dataframe may contain some NA values in the columns (Value, Id). I want to fill forward value but by Date only and I don't find anyway to do this a in a very efficient way.

这是我拥有的数据框类型:

Here is the type of dataframe I have :

这是我想要的结果:

因此,要按日期正确回填,我可以使用groupby(level = 0)函数. groupby很快,但是填充功能按日期应用于数据框的分组确实太慢了.

So to properly fillback by date I can use groupby(level=0) function. The groupby is fast but the fill function apply on the dataframe group by date is really too slow.

这是我用来比较简单填充(它没有给出预期结果,但是运行非常快)和按日期填充(它给出了预期结果,但实际上太慢)的代码.

Here is the code I use to compare simple fill forward (which doesn't give the expected result but is run very quickly) and expected fill forward by date (which give expected result but is really too slow).

import numpy as np
import pandas as pd
import datetime as dt

# Show pandas & numpy versions
print('pandas '+pd.__version__)
print('numpy '+np.__version__)

# Build a big list of (Date,InputTime,Value,Id)
listdata = []
d = dt.datetime(2001,10,6,5)
for i in range(0,100000):
    listdata.append((d.date(), d, 2*i if i%3==1 else np.NaN, i if i%3==1 else np.NaN))
    d = d + dt.timedelta(hours=8)

# Create the dataframe with Date and InputTime as index
df = pd.DataFrame.from_records(listdata, index=['Date','InputTime'], columns=['Date', 'InputTime', 'Value', 'Id'])

# Simple Fill forward on index
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].ffill()
end = dt.datetime.now()
print "Time to fill forward on index = " + str((end-start).total_seconds()) + " s"

# Fill forward on Date (first level of index)
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].groupby(level=0).ffill()
end = dt.datetime.now()
print "Time to fill forward on Date only = " + str((end-start).total_seconds()) + " s"

有人可以解释为什么这段代码这么慢,还是可以帮助我找到一种有效的方法来按日期填充大数据帧?

Could somebody explain me why this code is so slow or help me to find an efficient way to fill forward by date on big dataframes?

谢谢

推荐答案

github/jreback:这是#7895的重复. cython并未在groupby操作上实现.ffill(尽管可以实现),而是在每个组上调用python空间. 这是一种简单的方法. 网址: https://github.com/pandas-dev/pandas/issues/11296

github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group. here's an easy way to do this. url:https://github.com/pandas-dev/pandas/issues/11296

df = df.sort_index()
df.ffill() * (1 - df.isnull().astype(int)).groupby(level=0).cumsum().applymap(lambda x: None if x == 0 else 1)

这篇关于 pandas 填补绩效问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆