用滚动平均值或其他插值替换NaN或缺失值 [英] Replace NaN or missing values with rolling mean or other interpolation

查看:195
本文介绍了用滚动平均值或其他插值替换NaN或缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,其中包含每月数据,我想为其计算12个月的移动平均值.但是,缺少1月每个月的数据(NaN),所以我正在使用

I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using

pd.rolling_mean(data["variable"]), 12, center=True)

但这只是给我所有的NaN值.

but it just gives me all NaN values.

是否有一种简单的方法可以忽略NaN值?我了解实际上,这将成为11个月的移动平均线.

Is there a simple way that I can ignore the NaN values? I understand that in practice this would become a 11-month moving average.

数据框还有其他包含一月数据的变量,所以我不想只扔掉一月的列并进行11个月的移动平均.

The dataframe has other variables which have January data, so I don't want to just throw out the January columns and do an 11 month moving average.

推荐答案

有几种方法可以解决此问题,最好的方法取决于一月份的数据是否与其他月份系统地不同.大多数现实世界的数据可能都是季节性的,所以让我们以北半球一个随机城市的平均高温(华氏度)为例.

There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example.

df=pd.DataFrame({ 'month' : [10,11,12,1,2,3],
                  'temp'  : [65,50,45,np.nan,40,43] }).set_index('month')

您可以按照建议使用滚动平均值,但是问题是您将获得全年的平均气温,而忽略了1月是最冷的月份这一事实.要对此进行更正,可以将窗口减小到3,这将导致1月的温度是12月和2月温度的平均值. (我也在使用@ user394430的答案中建议的min_periods=1.)

You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using min_periods=1 as suggested in @user394430's answer.)

df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean()
df['rollmean3']  = df['temp'].rolling( 3,center=True,min_periods=1).mean()

这些是改进,但是仍然存在使用滚动方式覆盖现有值的问题.为避免这种情况,您可以与update()方法结合使用(

Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the update() method (see documentation here).

df['update'] = df['rollmean3']
df['update'].update( df['temp'] )  # note: this is an inplace operation

甚至有更简单的方法可以保留现有值,而用上个月,下个月或上个月和下个月的平均值填充缺失的一月温度.

There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month.

df['ffill']   = df['temp'].ffill()         # previous month 
df['bfill']   = df['temp'].bfill()         # next month
df['interp']  = df['temp'].interpolate()   # mean of prev/next

在这种情况下,interpolate()默认为简单线性解释,但是您还可以使用其他几个内插选项.有关更多信息,请参见有关插值熊猫的文档.或这个statck溢出问题: 在熊猫的DataFrame上进行插值

In this case, interpolate() defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolate for more info. Or this statck overflow question: Interpolation on DataFrame in pandas

以下是带有所有结果的样本数据:

Here is the sample data with all the results:

       temp  rollmean12  rollmean3  update  ffill  bfill  interp
month                                                           
10     65.0        48.6  57.500000    65.0   65.0   65.0    65.0
11     50.0        48.6  53.333333    50.0   50.0   50.0    50.0
12     45.0        48.6  47.500000    45.0   45.0   45.0    45.0
1       NaN        48.6  42.500000    42.5   45.0   40.0    42.5
2      40.0        48.6  41.500000    40.0   40.0   40.0    40.0
3      43.0        48.6  41.500000    43.0   43.0   43.0    43.0

尤其要注意,"update"和"interp"在所有月份中给出的结果相同.虽然在这里使用哪种都没关系,但在其他情况下,一种或另一种可能会更好.

In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.

这篇关于用滚动平均值或其他插值替换NaN或缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆