基于pandas中的一些条件比较两个时间序列数据帧 [英] comparing two timeseries dataframes based on some conditions in pandas

查看:38
本文介绍了基于pandas中的一些条件比较两个时间序列数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个时间序列数据帧 df1df2:

df1 = pd.DataFrame({'date_1':['10/11/2017 0:00','10/11/2017 03:00','10/11/2017 06:00','10/11/2017 09:00'],'value_1':[5000,1500,np.nan,2000]})df1['date_1'] = pd.to_datetime(df1.date_1.astype(str), format='%m/%d/%Y %H:%M',errors ='coerce')df1.index = pd.DatetimeIndex(df1.date_1)df1.drop('date_1', axis = 1, inplace = True)

&

df2 = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-1100:50:00', '2017-10-11 01:20:00','2017-10-11 01:40:00'、'2017-10-11 02:20:00'、'2017-10-11 02:50:00'、'2017-10-11 03:00:00','2017-10-11 03:20:00'、'2017-10-11 03:50:00'、'2017-10-11 04:20:00'、'2017-10-11 04:50:00'','2017-10-11 05:20:00'、'2017-10-11 05:50:00'、'2017-10-11 06:00:00'、'2017-10-11 06:20:00'','2017-10-11 06:50:00'、'2017-10-11 07:20:00'、'2017-10-11 07:50:00'、'2017-10-11 08:20:00'','2017-10-11 08:50:00'、'2017-10-11 09:20:00'、'2017-10-11 09:50:00'、'2017-10-11 10:20:00''],'value_2':[1500.0, 2050.0, np.nan, 2400.0,2500.0, 2550.0, 2900.0, np.nan,3200.0, 3500.0, np.nan, 3600.0,2600.0、2500.0、2350.0、2200.0、np.nan, 2100.0, np.nan, 2400.0,2600.0, np.nan, 8000.0, 9000.0]})df2['date_2'] = pd.to_datetime(df2.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce')df2.index = pd.DatetimeIndex(df2.date_2)df2.drop('date_2', axis = 1, inplace = True)

两个数据帧都是同一天的观测值,但具有不同的时间分辨率.df1 的时间分辨率为 3 小时,而 df2 的时间分辨率为 30 分钟 或更少.我有兴趣通过将上述数据帧与某些条件进行比较来创建一个新的数据帧 dfx,并在 dfx 中创建两列 countduration.

  • 首先:看df_2['value_2']
  • 比较 df_2['value_2']df_1['value_1']
  • 如果 df_2['value_2']<2800 用于时间戳 &df_1['value_1'] >1600 对于 df1 分辨率的最近一半内的时间戳,即 01:30 我们计算 event1 否则为 0.
  • 例如对于 df2 00:00:00 - 01:30:00 的时间戳,将 df_2['value_2'] 值与
    进行比较df_1['value_1']00:00:00
  • 对于 df2 01:31:00 - 03:00:00 的时间戳,将 df_2['value_2'] 值与
    df_1['value_1']03:00:00
  • 对于 df2 03:00:00 - 04:30:00 的时间戳,将 df_2['value_2'] 值与
    df_1['value_1']03:00:00
  • 对于 df2 04:31:00 - 06:00:00 的时间戳,将 df_2['value_2'] 值与
    df_1['value_1']06:00:00等等.其中,
  • 如果 df2['value_2] == np.nan 对于时间戳 tnan 值替换为时间戳的平均值<代码>t-1 &t+1 然后进行比较.
  • 如果 df1['value_1] == np.nan 对于时间戳 t ,给对应的 count 值 0.

对于 dfx 中的 duration 列:dfx['duration] = df2.index[i+1] - df2.index[i]对于 count 边缘时间戳,如 01:20:00dfx['duration] = (df1.index[i] + 01:30) - df2.index[i]在哪里.df1.index[i]df1 的时间戳,与 df2 进行比较.

期望输出

dfx = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-1100:50:00', '2017-10-11 01:20:00','2017-10-11 01:40:00'、'2017-10-11 02:20:00'、'2017-10-11 02:50:00'、'2017-10-11 03:00:00','2017-10-11 03:20:00'、'2017-10-11 03:50:00'、'2017-10-11 04:20:00'、'2017-10-11 04:50:00'','2017-10-11 05:20:00'、'2017-10-11 05:50:00'、'2017-10-11 06:00:00'、'2017-10-11 06:20:00'','2017-10-11 06:50:00'、'2017-10-11 07:20:00'、'2017-10-11 07:50:00'、'2017-10-11 08:20:00'','2017-10-11 08:50:00'、'2017-10-11 09:20:00'、'2017-10-11 09:50:00'、'2017-10-11 10:20:00''],'计数':[1, 1, 1, 1,0, 0, 0, 0,0, 0, 0, 0,0, 0, 0, 0,0, 0, 1, 1,1, 0, 0, 0],'持续时间':['00:30','00:20','00:30','00:10',00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:30"、00:30"、'00:10', '00:00', '00:00', '00:00']})dfx['date_2'] = pd.to_datetime(dfx.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce')dfx.index = pd.DatetimeIndex(dfx.date_2)dfx.drop('date_2', axis = 1, inplace = True)

尽管我希望缩短它,但我的问题已经变得很长.请忍耐一下.我非常感谢您的帮助.

谢谢!

解决方案

输入数据:

<预><代码>>>>df1值_1日期_12017-10-11 00:00:00 5000.02017-10-11 03:00:00 1500.02017-10-11 06:00:00 1200.02017-10-11 09:00:00 NaN>>>df2值_2日期_22017-10-11 00:00:00 1500.02017-10-11 00:30:00 2050.02017-10-11 00:50:00 NaN2017-10-11 01:20:00 2400.02017-10-11 01:40:00 2500.0...2017-10-11 08:20:00 2400.02017-10-11 08:50:00 2600.02017-10-11 09:20:00 NaN2017-10-11 09:50:00 8000.02017-10-11 10:20:00 9000.0

  1. 通过 t-1t+1 之间的线性插值从 df2 中填充 NaN 值:

df2['value_2'] = df2['value_2'].interpolate()

  1. 根据您的规则从 df1 创建一个间隔:

ii = pd.IntervalIndex.from_tuples(列表(zip(df1.index - pd.DateOffset(小时= 1,分钟= 29),df1.index + pd.DateOffset(小时=1,分钟=30))))

  1. 将值分为离散区间:

df1['interval'] = pd.cut(df1.index, bins=ii)df2['interval'] = pd.cut(df2.index, bins=ii)

  1. interval 上合并两个数据帧:

dfx = pd.merge(df2, df1, on='interval', how='left').set_index('interval')dfx = (dfx['value_2'].lt(2800) & dfx['value_1'].gt(1600)) \.astype(int).to_frame('count').set_index(df2.index)

  1. 附加 df1 的索引,频率为 90 分钟:

dti = df2.index.append(pd.DatetimeIndex(df1.index.to_series().resample('90T').groups.keys())).sort_values().drop_duplicates()dfx = dfx.reindex(dti).ffill().astype(int)

  1. count 计算持续时间并从 df2 重新索引:

dfx['duration'] = dfx.index.to_series().diff(-1).abs() \.fillna(pd.Timedelta(0)).dt.components \.apply(lambda x: f"{x['hours']:02}:{x['minutes']:02}",轴=列")dfx.loc[dfx['count'] == 0, 'duration'] = '00:00'dfx = dfx.reindex(df2.index)

输出结果:

<预><代码>>>>文件计数持续时间日期_22017-10-11 00:00:00 1 00:302017-10-11 00:30:00 1 00:202017-10-11 00:50:00 1 00:302017-10-11 01:20:00 1 00:102017-10-11 01:40:00 0 00:002017-10-11 02:20:00 0 00:002017-10-11 02:50:00 0 00:002017-10-11 03:00:00 0 00:002017-10-11 03:20:00 0 00:002017-10-11 03:50:00 0 00:002017-10-11 04:20:00 0 00:002017-10-11 04:50:00 0 00:002017-10-11 05:20:00 0 00:002017-10-11 05:50:00 0 00:002017-10-11 06:00:00 0 00:002017-10-11 06:20:00 0 00:002017-10-11 06:50:00 0 00:002017-10-11 07:20:00 0 00:002017-10-11 07:50:00 1 00:302017-10-11 08:20:00 1 00:302017-10-11 08:50:00 1 00:102017-10-11 09:20:00 0 00:002017-10-11 09:50:00 0 00:002017-10-11 10:20:00 0 00:00

I have two timeseries dataframes df1 and df2:

df1 = pd.DataFrame({'date_1':['10/11/2017 0:00','10/11/2017 03:00','10/11/2017 06:00','10/11/2017 09:00'],
                  'value_1':[5000,1500,np.nan,2000]})

df1['date_1'] = pd.to_datetime(df1.date_1.astype(str), format='%m/%d/%Y %H:%M',errors ='coerce') 
df1.index = pd.DatetimeIndex(df1.date_1)
df1.drop('date_1', axis = 1, inplace = True)

&

df2 = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00',
                             '2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00',
                             '2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00',
                             '2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00',
                             '2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00',
                             '2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],

                  'value_2':[1500.0, 2050.0,  np.nan,  2400.0, 
                           2500.0,  2550.0,  2900.0,  np.nan,
                           3200.0,  3500.0,  np.nan,  3600.0,
                           2600.0,  2500.0,  2350.0,  2200.0,
                           np.nan,  2100.0,  np.nan,  2400.0,
                           2600.0,  np.nan,  8000.0,  9000.0]
                    })
df2['date_2'] = pd.to_datetime(df2.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce') 
df2.index = pd.DatetimeIndex(df2.date_2)
df2.drop('date_2', axis = 1, inplace = True)

Both dataframes are observations on the same day but with different time resolution. df1 has time resolution of 3 hours whereas df2 has time resolution of 30 minutes or less. I am interested to create a new dataframe dfx by comparing above dataframes with certain conditions, and create two columns count and duration in dfx.

  • firstly: look at df_2['value_2']
  • compare df_2['value_2'] with df_1['value_1']
  • if df_2['value_2']<2800 for a timestamp & df_1['value_1'] >1600 for a timestamp within nearest half of the resolution of df1 i.e. 01:30 we count the event as 1 otherwise 0.
  • e.g. for a timestamps of df2 00:00:00 - 01:30:00 compare df_2['value_2'] values with
    df_1['value_1'] at 00:00:00
  • for a timestamps of df2 01:31:00 - 03:00:00 compare df_2['value_2'] values with
    df_1['value_1'] at 03:00:00
  • for a timestamps of df2 03:00:00 - 04:30:00 compare df_2['value_2'] values with
    df_1['value_1'] at 03:00:00
  • for a timestamps of df2 04:31:00 - 06:00:00 compare df_2['value_2'] values with
    df_1['value_1'] at 06:00:00 and so on. where,
  • if df2['value_2] == np.nan for a timestamp t replace the nan value with average of values at timestampst-1 & t+1 and then make the comparison.
  • if df1['value_1] == np.nan for a timestamp t , give the corresponding count value 0.

For the duration column in dfx: dfx['duration] = df2.index[i+1] - df2.index[i] for count on marginal time stamps like 01:20:00, dfx['duration] = (df1.index[i] + 01:30) - df2.index[i] where. df1.index[i] is the timestamp of df1 with which comparison of df2 is made.

Desired output

dfx = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00',
                             '2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00',
                             '2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00',
                             '2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00',
                             '2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00',
                             '2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],

                  'count':[1, 1,  1,  1, 
                           0,  0,  0, 0,
                           0,  0,  0,  0,
                           0,  0,  0,  0,
                           0,  0,  1,  1,
                           1,  0,  0,  0],
                    
                    'duration':['00:30','00:20','00:30','00:10',
                                '00:00', '00:00', '00:00', '00:00',
                                '00:00', '00:00', '00:00', '00:00',
                                '00:00', '00:00', '00:00', '00:00',
                                '00:00', '00:00', '00:30', '00:30',
                                '00:10', '00:00', '00:00', '00:00']})
                        
dfx['date_2'] = pd.to_datetime(dfx.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce') 
dfx.index = pd.DatetimeIndex(dfx.date_2)
dfx.drop('date_2', axis = 1, inplace = True)

My question has become quite long in spite of my desire to shorten it. Please, bear with it. I would highly appreciate your kind help.

Thanks!

解决方案

Input data:

>>> df1
                     value_1
date_1
2017-10-11 00:00:00   5000.0
2017-10-11 03:00:00   1500.0
2017-10-11 06:00:00   1200.0
2017-10-11 09:00:00      NaN

>>> df2
                     value_2
date_2
2017-10-11 00:00:00   1500.0
2017-10-11 00:30:00   2050.0
2017-10-11 00:50:00      NaN
2017-10-11 01:20:00   2400.0
2017-10-11 01:40:00   2500.0
...
2017-10-11 08:20:00   2400.0
2017-10-11 08:50:00   2600.0
2017-10-11 09:20:00      NaN
2017-10-11 09:50:00   8000.0
2017-10-11 10:20:00   9000.0

  1. Fill NaN value from df2 by linear interpolation between t-1 and t+1:

df2['value_2'] = df2['value_2'].interpolate()

  1. Create an interval from df1 according to your rules:

ii = pd.IntervalIndex.from_tuples(
         list(zip(df1.index - pd.DateOffset(hours=1, minutes=29),
                  df1.index + pd.DateOffset(hours=1, minutes=30)))
     )

  1. Bin values into discrete intervals:

df1['interval'] = pd.cut(df1.index, bins=ii)
df2['interval'] = pd.cut(df2.index, bins=ii)

  1. Merge the two dataframes on interval:

dfx = pd.merge(df2, df1, on='interval', how='left').set_index('interval')
dfx = (dfx['value_2'].lt(2800) & dfx['value_1'].gt(1600)) \
          .astype(int).to_frame('count').set_index(df2.index)

  1. Append index of df1 with as a freq of 90 minutes:

dti = df2.index.append(
          pd.DatetimeIndex(df1.index.to_series().resample('90T').groups.keys())
      ).sort_values().drop_duplicates()
dfx = dfx.reindex(dti).ffill().astype(int)

  1. Compute duration from count and reindex from df2:

dfx['duration'] = dfx.index.to_series().diff(-1).abs() \
                     .fillna(pd.Timedelta(0)).dt.components \
                     .apply(lambda x: f"{x['hours']:02}:{x['minutes']:02}",
                            axis='columns')

dfx.loc[dfx['count'] == 0, 'duration'] = '00:00'
dfx = dfx.reindex(df2.index)

Output result:

>>> dfx
                     count duration
date_2
2017-10-11 00:00:00      1    00:30
2017-10-11 00:30:00      1    00:20
2017-10-11 00:50:00      1    00:30
2017-10-11 01:20:00      1    00:10
2017-10-11 01:40:00      0    00:00
2017-10-11 02:20:00      0    00:00
2017-10-11 02:50:00      0    00:00
2017-10-11 03:00:00      0    00:00
2017-10-11 03:20:00      0    00:00
2017-10-11 03:50:00      0    00:00
2017-10-11 04:20:00      0    00:00
2017-10-11 04:50:00      0    00:00
2017-10-11 05:20:00      0    00:00
2017-10-11 05:50:00      0    00:00
2017-10-11 06:00:00      0    00:00
2017-10-11 06:20:00      0    00:00
2017-10-11 06:50:00      0    00:00
2017-10-11 07:20:00      0    00:00
2017-10-11 07:50:00      1    00:30
2017-10-11 08:20:00      1    00:30
2017-10-11 08:50:00      1    00:10
2017-10-11 09:20:00      0    00:00
2017-10-11 09:50:00      0    00:00
2017-10-11 10:20:00      0    00:00

这篇关于基于pandas中的一些条件比较两个时间序列数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆