基于pandas中的一些条件比较两个时间序列数据帧 [英] comparing two timeseries dataframes based on some conditions in pandas
问题描述
我有两个时间序列数据帧 df1
和 df2
:
df1 = pd.DataFrame({'date_1':['10/11/2017 0:00','10/11/2017 03:00','10/11/2017 06:00','10/11/2017 09:00'],'value_1':[5000,1500,np.nan,2000]})df1['date_1'] = pd.to_datetime(df1.date_1.astype(str), format='%m/%d/%Y %H:%M',errors ='coerce')df1.index = pd.DatetimeIndex(df1.date_1)df1.drop('date_1', axis = 1, inplace = True)
&
df2 = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-1100:50:00', '2017-10-11 01:20:00','2017-10-11 01:40:00'、'2017-10-11 02:20:00'、'2017-10-11 02:50:00'、'2017-10-11 03:00:00','2017-10-11 03:20:00'、'2017-10-11 03:50:00'、'2017-10-11 04:20:00'、'2017-10-11 04:50:00'','2017-10-11 05:20:00'、'2017-10-11 05:50:00'、'2017-10-11 06:00:00'、'2017-10-11 06:20:00'','2017-10-11 06:50:00'、'2017-10-11 07:20:00'、'2017-10-11 07:50:00'、'2017-10-11 08:20:00'','2017-10-11 08:50:00'、'2017-10-11 09:20:00'、'2017-10-11 09:50:00'、'2017-10-11 10:20:00''],'value_2':[1500.0, 2050.0, np.nan, 2400.0,2500.0, 2550.0, 2900.0, np.nan,3200.0, 3500.0, np.nan, 3600.0,2600.0、2500.0、2350.0、2200.0、np.nan, 2100.0, np.nan, 2400.0,2600.0, np.nan, 8000.0, 9000.0]})df2['date_2'] = pd.to_datetime(df2.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce')df2.index = pd.DatetimeIndex(df2.date_2)df2.drop('date_2', axis = 1, inplace = True)
两个数据帧都是同一天的观测值,但具有不同的时间分辨率.df1
的时间分辨率为 3 小时
,而 df2
的时间分辨率为 30 分钟
或更少.我有兴趣通过将上述数据帧与某些条件进行比较来创建一个新的数据帧 dfx
,并在 dfx 中创建两列
count
和 duration
代码>.
- 首先:看
df_2['value_2']
- 比较
df_2['value_2']
和df_1['value_1']
- 如果
df_2['value_2']<2800
用于时间戳 &df_1['value_1'] >1600
对于df1
分辨率的最近一半内的时间戳,即01:30
我们计算event
为1
否则为0
. - 例如对于
df2
00:00:00 - 01:30:00
的时间戳,将df_2['value_2']
值与
进行比较df_1['value_1']
在00:00:00
- 对于
df2
01:31:00 - 03:00:00
的时间戳,将df_2['value_2']
值与df_1['value_1']
在03:00:00
- 对于
df2
03:00:00 - 04:30:00
的时间戳,将df_2['value_2']
值与df_1['value_1']
在03:00:00
- 对于
df2
04:31:00 - 06:00:00
的时间戳,将df_2['value_2']
值与df_1['value_1']
在06:00:00
等等.其中, - 如果
df2['value_2] == np.nan
对于时间戳t
将nan
值替换为时间戳的平均值<代码>t-1 &t+1 然后进行比较. - 如果
df1['value_1] == np.nan
对于时间戳t
,给对应的count
值 0.
对于 dfx
中的 duration
列:dfx['duration] = df2.index[i+1] - df2.index[i]
对于 count
边缘时间戳,如 01:20:00
,dfx['duration] = (df1.index[i] + 01:30) - df2.index[i]
在哪里.df1.index[i]
是 df1
的时间戳,与 df2 进行比较.
期望输出
dfx = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-1100:50:00', '2017-10-11 01:20:00','2017-10-11 01:40:00'、'2017-10-11 02:20:00'、'2017-10-11 02:50:00'、'2017-10-11 03:00:00','2017-10-11 03:20:00'、'2017-10-11 03:50:00'、'2017-10-11 04:20:00'、'2017-10-11 04:50:00'','2017-10-11 05:20:00'、'2017-10-11 05:50:00'、'2017-10-11 06:00:00'、'2017-10-11 06:20:00'','2017-10-11 06:50:00'、'2017-10-11 07:20:00'、'2017-10-11 07:50:00'、'2017-10-11 08:20:00'','2017-10-11 08:50:00'、'2017-10-11 09:20:00'、'2017-10-11 09:50:00'、'2017-10-11 10:20:00''],'计数':[1, 1, 1, 1,0, 0, 0, 0,0, 0, 0, 0,0, 0, 0, 0,0, 0, 1, 1,1, 0, 0, 0],'持续时间':['00:30','00:20','00:30','00:10',00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:00"、00:30"、00:30"、'00:10', '00:00', '00:00', '00:00']})dfx['date_2'] = pd.to_datetime(dfx.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce')dfx.index = pd.DatetimeIndex(dfx.date_2)dfx.drop('date_2', axis = 1, inplace = True)
尽管我希望缩短它,但我的问题已经变得很长.请忍耐一下.我非常感谢您的帮助.
谢谢!
输入数据:
<预><代码>>>>df1值_1日期_12017-10-11 00:00:00 5000.02017-10-11 03:00:00 1500.02017-10-11 06:00:00 1200.02017-10-11 09:00:00 NaN>>>df2值_2日期_22017-10-11 00:00:00 1500.02017-10-11 00:30:00 2050.02017-10-11 00:50:00 NaN2017-10-11 01:20:00 2400.02017-10-11 01:40:00 2500.0...2017-10-11 08:20:00 2400.02017-10-11 08:50:00 2600.02017-10-11 09:20:00 NaN2017-10-11 09:50:00 8000.02017-10-11 10:20:00 9000.0- 通过
t-1
和t+1
之间的线性插值从 df2 中填充NaN
值:
df2['value_2'] = df2['value_2'].interpolate()
- 根据您的规则从 df1 创建一个间隔:
ii = pd.IntervalIndex.from_tuples(列表(zip(df1.index - pd.DateOffset(小时= 1,分钟= 29),df1.index + pd.DateOffset(小时=1,分钟=30))))
- 将值分为离散区间:
df1['interval'] = pd.cut(df1.index, bins=ii)df2['interval'] = pd.cut(df2.index, bins=ii)
- 在
interval
上合并两个数据帧:
dfx = pd.merge(df2, df1, on='interval', how='left').set_index('interval')dfx = (dfx['value_2'].lt(2800) & dfx['value_1'].gt(1600)) \.astype(int).to_frame('count').set_index(df2.index)
- 附加
df1
的索引,频率为 90 分钟:
dti = df2.index.append(pd.DatetimeIndex(df1.index.to_series().resample('90T').groups.keys())).sort_values().drop_duplicates()dfx = dfx.reindex(dti).ffill().astype(int)
- 从
count
计算持续时间并从df2
重新索引:
dfx['duration'] = dfx.index.to_series().diff(-1).abs() \.fillna(pd.Timedelta(0)).dt.components \.apply(lambda x: f"{x['hours']:02}:{x['minutes']:02}",轴=列")dfx.loc[dfx['count'] == 0, 'duration'] = '00:00'dfx = dfx.reindex(df2.index)
输出结果:
<预><代码>>>>文件计数持续时间日期_22017-10-11 00:00:00 1 00:302017-10-11 00:30:00 1 00:202017-10-11 00:50:00 1 00:302017-10-11 01:20:00 1 00:102017-10-11 01:40:00 0 00:002017-10-11 02:20:00 0 00:002017-10-11 02:50:00 0 00:002017-10-11 03:00:00 0 00:002017-10-11 03:20:00 0 00:002017-10-11 03:50:00 0 00:002017-10-11 04:20:00 0 00:002017-10-11 04:50:00 0 00:002017-10-11 05:20:00 0 00:002017-10-11 05:50:00 0 00:002017-10-11 06:00:00 0 00:002017-10-11 06:20:00 0 00:002017-10-11 06:50:00 0 00:002017-10-11 07:20:00 0 00:002017-10-11 07:50:00 1 00:302017-10-11 08:20:00 1 00:302017-10-11 08:50:00 1 00:102017-10-11 09:20:00 0 00:002017-10-11 09:50:00 0 00:002017-10-11 10:20:00 0 00:00I have two timeseries dataframes df1
and df2
:
df1 = pd.DataFrame({'date_1':['10/11/2017 0:00','10/11/2017 03:00','10/11/2017 06:00','10/11/2017 09:00'],
'value_1':[5000,1500,np.nan,2000]})
df1['date_1'] = pd.to_datetime(df1.date_1.astype(str), format='%m/%d/%Y %H:%M',errors ='coerce')
df1.index = pd.DatetimeIndex(df1.date_1)
df1.drop('date_1', axis = 1, inplace = True)
&
df2 = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00',
'2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00',
'2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00',
'2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00',
'2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00',
'2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],
'value_2':[1500.0, 2050.0, np.nan, 2400.0,
2500.0, 2550.0, 2900.0, np.nan,
3200.0, 3500.0, np.nan, 3600.0,
2600.0, 2500.0, 2350.0, 2200.0,
np.nan, 2100.0, np.nan, 2400.0,
2600.0, np.nan, 8000.0, 9000.0]
})
df2['date_2'] = pd.to_datetime(df2.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce')
df2.index = pd.DatetimeIndex(df2.date_2)
df2.drop('date_2', axis = 1, inplace = True)
Both dataframes are observations on the same day but with different time resolution. df1
has time resolution of 3 hours
whereas df2
has time resolution of 30 minutes
or less.
I am interested to create a new dataframe dfx
by comparing above dataframes with certain conditions, and create two columns count
and duration
in dfx
.
- firstly: look at
df_2['value_2']
- compare
df_2['value_2']
withdf_1['value_1']
- if
df_2['value_2']<2800
for a timestamp &df_1['value_1'] >1600
for a timestamp within nearest half of the resolution ofdf1
i.e.01:30
we count theevent
as1
otherwise0
. - e.g. for a timestamps of
df2
00:00:00 - 01:30:00
comparedf_2['value_2']
values with
df_1['value_1']
at00:00:00
- for a timestamps of
df2
01:31:00 - 03:00:00
comparedf_2['value_2']
values with
df_1['value_1']
at03:00:00
- for a timestamps of
df2
03:00:00 - 04:30:00
comparedf_2['value_2']
values with
df_1['value_1']
at03:00:00
- for a timestamps of
df2
04:31:00 - 06:00:00
comparedf_2['value_2']
values with
df_1['value_1']
at06:00:00
and so on. where, - if
df2['value_2] == np.nan
for a timestampt
replace thenan
value with average of values at timestampst-1 & t+1
and then make the comparison. - if
df1['value_1] == np.nan
for a timestampt
, give the correspondingcount
value 0.
For the duration
column in dfx
:
dfx['duration] = df2.index[i+1] - df2.index[i]
for count
on marginal time stamps like 01:20:00
,
dfx['duration] = (df1.index[i] + 01:30) - df2.index[i]
where. df1.index[i]
is the timestamp of df1
with which comparison of df2 is made.
Desired output
dfx = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00',
'2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00',
'2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00',
'2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00',
'2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00',
'2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],
'count':[1, 1, 1, 1,
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 1, 1,
1, 0, 0, 0],
'duration':['00:30','00:20','00:30','00:10',
'00:00', '00:00', '00:00', '00:00',
'00:00', '00:00', '00:00', '00:00',
'00:00', '00:00', '00:00', '00:00',
'00:00', '00:00', '00:30', '00:30',
'00:10', '00:00', '00:00', '00:00']})
dfx['date_2'] = pd.to_datetime(dfx.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce')
dfx.index = pd.DatetimeIndex(dfx.date_2)
dfx.drop('date_2', axis = 1, inplace = True)
My question has become quite long in spite of my desire to shorten it. Please, bear with it. I would highly appreciate your kind help.
Thanks!
Input data:
>>> df1
value_1
date_1
2017-10-11 00:00:00 5000.0
2017-10-11 03:00:00 1500.0
2017-10-11 06:00:00 1200.0
2017-10-11 09:00:00 NaN
>>> df2
value_2
date_2
2017-10-11 00:00:00 1500.0
2017-10-11 00:30:00 2050.0
2017-10-11 00:50:00 NaN
2017-10-11 01:20:00 2400.0
2017-10-11 01:40:00 2500.0
...
2017-10-11 08:20:00 2400.0
2017-10-11 08:50:00 2600.0
2017-10-11 09:20:00 NaN
2017-10-11 09:50:00 8000.0
2017-10-11 10:20:00 9000.0
- Fill
NaN
value from df2 by linear interpolation betweent-1
andt+1
:
df2['value_2'] = df2['value_2'].interpolate()
- Create an interval from df1 according to your rules:
ii = pd.IntervalIndex.from_tuples(
list(zip(df1.index - pd.DateOffset(hours=1, minutes=29),
df1.index + pd.DateOffset(hours=1, minutes=30)))
)
- Bin values into discrete intervals:
df1['interval'] = pd.cut(df1.index, bins=ii)
df2['interval'] = pd.cut(df2.index, bins=ii)
- Merge the two dataframes on
interval
:
dfx = pd.merge(df2, df1, on='interval', how='left').set_index('interval')
dfx = (dfx['value_2'].lt(2800) & dfx['value_1'].gt(1600)) \
.astype(int).to_frame('count').set_index(df2.index)
- Append index of
df1
with as a freq of 90 minutes:
dti = df2.index.append(
pd.DatetimeIndex(df1.index.to_series().resample('90T').groups.keys())
).sort_values().drop_duplicates()
dfx = dfx.reindex(dti).ffill().astype(int)
- Compute duration from
count
and reindex fromdf2
:
dfx['duration'] = dfx.index.to_series().diff(-1).abs() \
.fillna(pd.Timedelta(0)).dt.components \
.apply(lambda x: f"{x['hours']:02}:{x['minutes']:02}",
axis='columns')
dfx.loc[dfx['count'] == 0, 'duration'] = '00:00'
dfx = dfx.reindex(df2.index)
Output result:
>>> dfx
count duration
date_2
2017-10-11 00:00:00 1 00:30
2017-10-11 00:30:00 1 00:20
2017-10-11 00:50:00 1 00:30
2017-10-11 01:20:00 1 00:10
2017-10-11 01:40:00 0 00:00
2017-10-11 02:20:00 0 00:00
2017-10-11 02:50:00 0 00:00
2017-10-11 03:00:00 0 00:00
2017-10-11 03:20:00 0 00:00
2017-10-11 03:50:00 0 00:00
2017-10-11 04:20:00 0 00:00
2017-10-11 04:50:00 0 00:00
2017-10-11 05:20:00 0 00:00
2017-10-11 05:50:00 0 00:00
2017-10-11 06:00:00 0 00:00
2017-10-11 06:20:00 0 00:00
2017-10-11 06:50:00 0 00:00
2017-10-11 07:20:00 0 00:00
2017-10-11 07:50:00 1 00:30
2017-10-11 08:20:00 1 00:30
2017-10-11 08:50:00 1 00:10
2017-10-11 09:20:00 0 00:00
2017-10-11 09:50:00 0 00:00
2017-10-11 10:20:00 0 00:00
这篇关于基于pandas中的一些条件比较两个时间序列数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!