当缺少多日数据时,用NaN填充数据框 [英] fill dataframe with NaN when multiple days data is missing
问题描述
我有一个熊猫数据框,可以对其进行插值以获得每日数据框.原始数据框如下所示:
I have a pandas dataframe which I interpolate to get a daily dataframe. The original dataframe looks like this:
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-14 0.000000 0.161556
2017-10-17 0.000000 0.116264
在插值数据框中,我想将日期间隔超过5天的数据值更改为NaN.例如.在上面的数据框中,2017-10-02
和2017-10-12
之间的间隔超过5天,因此在插值数据框中,应删除这两个日期之间的所有值.我不确定如何执行此操作,也许是combine_first
?
In the interpolated dataframe, I want to change data values to NaN where the gap in dates exceeds 5 days. E.g. in the dataframe above, the gap between 2017-10-02
and 2017-10-12
exceeds 5 days therefore in the interpolated dataframe all values between these 2 dates should be removed. I am not sure how to do this, maybe combine_first
?
-内插数据帧如下所示:
-- Interpolated dataframe looks like so:
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-03 0.015804 0.113309
2017-10-04 0.014464 0.113750
2017-10-05 0.013125 0.114190
2017-10-06 0.011786 0.114631
2017-10-07 0.010446 0.115071
2017-10-08 0.009107 0.115512
2017-10-09 0.007768 0.115953
2017-10-10 0.006429 0.116393
2017-10-11 0.005089 0.116834
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
预期输出:
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
推荐答案
我首先确定差距超过5天的地方.从那里,我生成了一个数组,用于标识这些间隙之间的组.最后,我将使用groupby
转到每日频率并进行插值.
I'd first identify where the gaps exceeded 5 days. From there, I generate an array that identified groups between such gaps. Finally, I'd use groupby
to turn to daily frequency and interpolate.
# convenience: assign string to variable for easier access
daytype = 'timedelta64[D]'
# define five days for use when evaluating size of gaps
five = np.array(5, dtype=daytype)
# get the size of gaps
deltas = np.diff(df.index.values).astype(daytype)
# identify groups between gaps
groups = np.append(False, deltas > five).cumsum()
# handy function to turn to daily frequency and interpolate
to_daily = lambda x: x.asfreq('D').interpolate()
# and finally...
df.groupby(groups, group_keys=False).apply(to_daily)
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
如果要提供自己的插值方法.您可以像这样修改上面的内容:
In the event you want to provide your own interpolation method. You can modify the above like this:
daytype = 'timedelta64[D]'
five = np.array(5, dtype=daytype)
deltas = np.diff(df.index.values).astype(daytype)
groups = np.append(False, deltas > five).cumsum()
# custom interpolation function that takes a dataframe
def my_interpolate(df):
"""This can be whatever you want.
I just provided what will result
in the same thing as before."""
return df.interpolate()
to_daily = lambda x: x.asfreq('D').pipe(my_interpolate)
df.groupby(groups, group_keys=False).apply(to_daily)
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
这篇关于当缺少多日数据时,用NaN填充数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!