根据Python中发生NaN的时间,通过“填充”和“内插”来填充NaN [英] Filling NaN by 'ffill' and 'interpolate' depending on time of the day of NaN occurrence in Python

查看:236
本文介绍了根据Python中发生NaN的时间,通过“填充”和“内插”来填充NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用均值和内插在df中填充NaN,具体取决于NaN发生的时间。如下所示,第一个NaN发生在上午6点,第二个NaN发生在上午8点。

I want to fill NaN in a df using 'mean' and 'interpolate' depending on at what time of the day the NaN occur. As you can see below, the first NaN occur at 6 am and the second NaN is at 8 am.

02/03/2016 05:00    8
02/03/2016 06:00    NaN
02/03/2016 07:00    1
02/03/2016 08:00    NaN
02/03/2016 09:00    3

我的df由数千天组成。对于上午7点之前发生的所有NaN,我想应用填充,对于上午7点之后发生的所有NaN应用内插。我的数据是从早上6点到下午6点。

My df consists of thousand of days. I want to apply 'ffill' for any NaN occur before 7 am and apply 'interpolate' for those occur after 7 am. My data is from 6 am to 6 pm.

我的尝试是:

df_imputed = (df.between_time("00:00:00", "07:00:00", include_start=True, include_end=False)).ffill()
df_imputed = (df.between_time("07:00:00", "18:00:00", include_start=True, include_end=True)).interpolate()   

编辑:我的df包含大约400列,因此该过程将应用于所有列。

my df contains around 400 columns so the procedure will apply to all columns.

推荐答案

原始问题:单个值系列



您可以定义布尔序列,然后根据您的条件 内插 填充 通过 numpy.where

# setup
df = pd.DataFrame({'date': ['02/03/2016 05:00', '02/03/2016 06:00', '02/03/2016 07:00',
                            '02/03/2016 08:00', '02/03/2016 09:00'],
                   'value': [8, np.nan, 1, np.nan, 3]})
df['date'] = pd.to_datetime(df['date'])

# construct Boolean switch series
switch = (df['date'] - df['date'].dt.normalize()) > pd.to_timedelta('07:00:00')

# use numpy.where to differentiate between two scenarios
df['value'] = np.where(switch, df['value'].interpolate(), df['value'].ffill())

print(df)

                 date  value
0 2016-02-03 05:00:00    8.0
1 2016-02-03 06:00:00    8.0
2 2016-02-03 07:00:00    1.0
3 2016-02-03 08:00:00    2.0
4 2016-02-03 09:00:00    3.0






更新的问题:多个值系列



具有多个值列,您可以使用 pd.DataFrame.where iloc 。或者,您可以使用iloc .html rel = nofollow noreferrer> loc 或其他方式(例如 filter )选择列:


Updated question: multiple series of values

With multiple value columns, you can adjust the above solution using pd.DataFrame.where and iloc. Or, instead of iloc, you can use loc or other means (e.g. filter) of selecting columns:

# setup
df = pd.DataFrame({'date': ['02/03/2016 05:00', '02/03/2016 06:00', '02/03/2016 07:00',
                            '02/03/2016 08:00', '02/03/2016 09:00'],
                   'value': [8, np.nan, 1, np.nan, 3],
                   'value2': [3, np.nan, 2, np.nan, 6]})
df['date'] = pd.to_datetime(df['date'])

# construct Boolean switch series
switch = (df['date'] - df['date'].dt.normalize()) > pd.to_timedelta('07:00:00')

# use numpy.where to differentiate between two scenarios
df.iloc[:, 1:] = df.iloc[:, 1:].interpolate().where(switch, df.iloc[:, 1:].ffill())

print(df)

                 date  value  value2
0 2016-02-03 05:00:00    8.0     3.0
1 2016-02-03 06:00:00    8.0     3.0
2 2016-02-03 07:00:00    1.0     2.0
3 2016-02-03 08:00:00    2.0     4.0
4 2016-02-03 09:00:00    3.0     6.0

这篇关于根据Python中发生NaN的时间,通过“填充”和“内插”来填充NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆