pandas ,检查datetimeindex的重新采样的30分钟时间段中是否存在时间戳值 [英] Pandas, check if timestamp value exists in resampled 30 min time bin of datetimeindex
问题描述
我用datetimeindex
在熊猫中创建了重新采样的数据帧(DF1).我有一个带有datetimeindex
和time
列的单独的数据框(DF2).如果来自DF2的time
实例落入DF1中datetimeindex
的30分钟区间内.我想用DF1的30分钟bin中的相应speed
标记DF2中的time
的每个实例.
DF1
boat_id speed
time
2015-01-13 09:00:00 28.000000 0.000000
2015-01-13 09:30:00 28.000000 0.723503
2015-01-13 10:00:00 28.000000 2.239399
DF2
id boat_id time state
time
2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2
2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0
2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1
2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0
所需结果
id boat_id time state speed
time
2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2 nan
2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0 nan
2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1 2.239399
2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0 2.239399
我创建了这个脚本来尝试执行此操作,但是我认为它失败了,因为DF1的datetimeindex
是不可变的,因此我的timedelta
请求没有为该块创建起点.我曾经想过,是否有可能将DF1的datetimeindex
复制到对象可以更改的新列中,但我尚未对其进行管理,因此不能百分百确定逻辑.我很乐于修补,但此刻我已经停滞了一段时间,希望有人能对此有所启发.预先感谢.
for row in DF1.iterrows():
for dfrow in DF2.iterrows():
if dfrow[0] > row[0] - dt.timedelta(minutes=30) and dfrow[0] < row[0]:
df['test'] = row[1]
迭代的性能很低.更好的是使用矢量化解决方案.我使用了两次函数 merge
. 文档.>
输入:
print df1
boat_id speed
time
2015-03-01 09:00:00 28 0.000000
2015-03-01 09:30:00 28 0.723503
2015-03-01 10:00:00 28 2.239399
print df2
id boat_id time state
time
2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2
2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0
2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1
2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0
我重置两个数据框的索引,并创建由1
填充的帮助器列i
.
df1 = df1.reset_index()
df2 = df2.reset_index(drop=True)
df1['i'] = df2['i'] = 1
print df1
time boat_id speed i
0 2015-03-01 09:00:00 28 0.000000 1
1 2015-03-01 09:30:00 28 0.723503 1
2 2015-03-01 10:00:00 28 2.239399 1
print df2
id boat_id time state i
0 319437 28 2015-01-18 16:09:03 2 1
1 319451 28 2015-01-18 16:18:43 0 1
2 507108 31 2015-03-01 09:39:51 1 1
3 507109 31 2015-03-01 09:40:58 0 1
然后我通过帮助程序列i
合并了两个数据框.
df = df2.merge(df1, on='i', how='left')
df = df.rename(columns={'time_y':'Bin_time', 'time_x':'time'})
print df
id boat_id_x time state i Bin_time \
0 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 09:00:00
1 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 09:30:00
2 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 10:00:00
3 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 09:00:00
4 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 09:30:00
5 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 10:00:00
6 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 09:00:00
7 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 09:30:00
8 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 10:00:00
9 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 09:00:00
10 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 09:30:00
11 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 10:00:00
boat_id_y speed
0 28 0.000000
1 28 0.723503
2 28 2.239399
3 28 0.000000
4 28 0.723503
5 28 2.239399
6 28 0.000000
7 28 0.723503
8 28 2.239399
9 28 0.000000
10 28 0.723503
11 28 2.239399
输出按bin时间进行过滤:
df = df[((df.time >= (df.Bin_time - dt.timedelta(minutes=30))) & (df.time <= df.Bin_time ))]
df = df.drop(['Bin_time', 'id', 'boat_id_x', 'boat_id_y','state', 'i' ], axis=1 )
print df
time speed
8 2015-03-01 09:39:51 2.239399
11 2015-03-01 09:40:58 2.239399
然后df
被列time
与数据帧df2
合并.
df = df2.merge(df, on='time', how='left').reset_index(drop=True)
df = df.drop([ 'i' ], axis=1 )
print df
id boat_id time state speed
0 319437 28 2015-01-18 16:09:03 2 NaN
1 319451 28 2015-01-18 16:18:43 0 NaN
2 507108 31 2015-03-01 09:39:51 1 2.239399
3 507109 31 2015-03-01 09:40:58 0 2.239399
比较向量化和索引编制方法的方法,您可以在此处找到.. >
I have created a resampled data frame (DF1) in pandas with a datetimeindex
. I have a separate dataframe (DF2) with a datetimeindex
and time
column. If an instance of time
from DF2 falls within the 30 min bins of datetimeindex
in DF1. I want to mark each instance of time
in DF2 with the appropriate speed
from the 30 min bin in DF1.
DF1
boat_id speed
time
2015-01-13 09:00:00 28.000000 0.000000
2015-01-13 09:30:00 28.000000 0.723503
2015-01-13 10:00:00 28.000000 2.239399
DF2
id boat_id time state
time
2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2
2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0
2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1
2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0
Desired Result
id boat_id time state speed
time
2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2 nan
2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0 nan
2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1 2.239399
2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0 2.239399
I created this script to try and do this but I think it's failing because datetimeindex
of DF1 is immutable and so my timedelta
request doesn't create a start point for the chunk. One thought I had was if it would be possible to copy the datetimeindex
of DF1 into a new column where the objects are mutable but I haven't managed it yet so am not 100% sure of the logic. I'm happy to tinker but at the moment i've been stalled for a while so was hoping someone else might have a few ideas. Thanks in advance.
for row in DF1.iterrows():
for dfrow in DF2.iterrows():
if dfrow[0] > row[0] - dt.timedelta(minutes=30) and dfrow[0] < row[0]:
df['test'] = row[1]
Performance of iterating is very low. Better is use vectorized solution. I use twice function merge
. Docs.
Input:
print df1
boat_id speed
time
2015-03-01 09:00:00 28 0.000000
2015-03-01 09:30:00 28 0.723503
2015-03-01 10:00:00 28 2.239399
print df2
id boat_id time state
time
2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2
2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0
2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1
2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0
I reset index of both dataframes and create helper column i
filled by 1
.
df1 = df1.reset_index()
df2 = df2.reset_index(drop=True)
df1['i'] = df2['i'] = 1
print df1
time boat_id speed i
0 2015-03-01 09:00:00 28 0.000000 1
1 2015-03-01 09:30:00 28 0.723503 1
2 2015-03-01 10:00:00 28 2.239399 1
print df2
id boat_id time state i
0 319437 28 2015-01-18 16:09:03 2 1
1 319451 28 2015-01-18 16:18:43 0 1
2 507108 31 2015-03-01 09:39:51 1 1
3 507109 31 2015-03-01 09:40:58 0 1
Then I merged both dataframes by helper column i
.
df = df2.merge(df1, on='i', how='left')
df = df.rename(columns={'time_y':'Bin_time', 'time_x':'time'})
print df
id boat_id_x time state i Bin_time \
0 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 09:00:00
1 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 09:30:00
2 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 10:00:00
3 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 09:00:00
4 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 09:30:00
5 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 10:00:00
6 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 09:00:00
7 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 09:30:00
8 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 10:00:00
9 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 09:00:00
10 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 09:30:00
11 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 10:00:00
boat_id_y speed
0 28 0.000000
1 28 0.723503
2 28 2.239399
3 28 0.000000
4 28 0.723503
5 28 2.239399
6 28 0.000000
7 28 0.723503
8 28 2.239399
9 28 0.000000
10 28 0.723503
11 28 2.239399
Output is filtered by bin time:
df = df[((df.time >= (df.Bin_time - dt.timedelta(minutes=30))) & (df.time <= df.Bin_time ))]
df = df.drop(['Bin_time', 'id', 'boat_id_x', 'boat_id_y','state', 'i' ], axis=1 )
print df
time speed
8 2015-03-01 09:39:51 2.239399
11 2015-03-01 09:40:58 2.239399
And df
is merged by column time
with dataframe df2
.
df = df2.merge(df, on='time', how='left').reset_index(drop=True)
df = df.drop([ 'i' ], axis=1 )
print df
id boat_id time state speed
0 319437 28 2015-01-18 16:09:03 2 NaN
1 319451 28 2015-01-18 16:18:43 0 NaN
2 507108 31 2015-03-01 09:39:51 1 2.239399
3 507109 31 2015-03-01 09:40:58 0 2.239399
Comparing vectorized and indexing approach you can found in similar answer here.
这篇关于 pandas ,检查datetimeindex的重新采样的30分钟时间段中是否存在时间戳值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!