在python pandas中从上一年和下一年的相同月份插入一年中选定月份的每小时负荷? [英] Interpolate hourly load of a selected months of a year from the same months of the previous year and the next year in python pandas?

查看:35
本文介绍了在python pandas中从上一年和下一年的相同月份插入一年中选定月份的每小时负荷?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下三个数据框:

df1:

   date_time           system_load
01-01-2017 00:00:00    208111
01-01-2017 01:00:00    208311
01-01-2017 02:00:00    208311
01-01-2017 03:00:00    208011
  ...............       ...
31-12-2017 20:00:00    208611
31-12-2017 21:00:00    208411
31-12-2017 22:00:00    208111
31-12-2017 23:00:00    208911

df1的系统负载值没有问题.

The system load values of df1 has no problem.

df2:

   date_time           system_load
01-01-2018 00:00:00    208111
01-01-2018 01:00:00    208311
01-01-2018 02:00:00    208311
01-01-2018 03:00:00    208011
  ...............       ...
31-12-2018 20:00:00    209611
31-12-2018 21:00:00    209411
31-12-2018 22:00:00    209111
31-12-2018 23:00:00    209911

df2 的系统负载值从 06-03-2018 20:00:00 到 24-10-2018 22:00:00 丢失.

The system load values of df2 is missed starting from 06-03-2018 20:00:00 till up to 24-10-2018 22:00:00.

df3:

   date_time           system_load
01-01-2019 00:00:00    309119
01-01-2019 01:00:00    309391
01-01-2019 02:00:00    309811
01-01-2019 03:00:00    309711
  ...............       ...
31-12-2019 20:00:00    309611
31-12-2019 21:00:00    309411
31-12-2019 22:00:00    309111
31-12-2019 23:00:00    309911

df3的系统负载值没有问题.

The system load values of df3 has no problem.

我想要的是使用相应的 df1 和 df3 每小时记录(06-03-2017 20:00:00 直到 24-10-2017 22:00:00)以合适的方式插入 df2 中错过的每小时记录和 06-03-2019 20:00:00 至 24-10-2019 22:00:00 分别).根据Pierre D"的宝贵评论,我附上了我的缩放数据.

What I want is to interpolate in suitable way the missed hourly records in df2 using the corresponding df1 and df3 hourly records (06-03-2017 20:00:00 till up to 24-10-2017 22:00:00 and 06-03-2019 20:00:00 till up to 24-10-2019 22:00:00 respectively). Based on "Pierre D"'s valuable comment I attached my scaled data.

推荐答案

这是一个非常基本的策略,它只是从相邻年份中获取数据来填充缺失值.offset 精确地选择为 52 周,以反映可能的每周季节性.

Here is a very basic strategy that just takes data from neighboring years to fill the missing values. The offset is chosen to be precisely 52 weeks, so as to reflect possible weekly seasonality.

# get the whole series together, and resample to have missing data as NaN:
s = pd.concat([df1, df2, df3])['system_load'].resample('H').asfreq()

offset = 52 * 7 * 24  # 52 weeks, 7 days/week, 24 hours/day
filler = pd.concat([s.shift(offset), s.shift(-offset)], axis=1).mean(axis=1)
out = s.where(~s.isna(), filler)

# optional: make a new df2 with the filled values
df2mod = out.truncate(
    before='2018',
    after=pd.Timestamp('2019') - pd.Timedelta(1)
).to_frame('system_load')

注意事项:

  • out 包含填充"使用相邻年份的整个 system_load 系列.
  • 我们使用 pandas.DataFrame.mean() 来构建 filler 系列作为两个相邻年份的平均值,以一种照顾 的方式NaN(例如,如果一年或另一年有 NaN,则平均值是唯一的非 NaN 值).
  • 这是填充缺失数据的最基本方法之一,可能不会骗过细心的观察者.根据重建数据的预期用途,应考虑更精细的策略.数据重建是一个活跃的研究领域,文献中有复杂的方法.例如,可以使用 GAN 来构建一个很难区分的结果系列来自真实数据.
  • out contains the "filled" series for the whole system_load using neighboring years.
  • we use pandas.DataFrame.mean() to build the filler series as the mean of the two neighboring years, in a way that takes care of NaN (e.g. if one year or the other has NaN, then the mean is the only non-NaN value).
  • this is one of the most basic ways of filling the missing data, and likely won't fool a careful observer. Depending on the intended usage of the reconstructed data, a more elaborate strategy should be considered. Data reconstruction is an active field of research, and there are sophisticated methods in the literature. For example, one could use a GAN to build a resulting series that would be very hard to discriminate from real data.

这篇关于在python pandas中从上一年和下一年的相同月份插入一年中选定月份的每小时负荷?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆