合并具有复杂条件的两个 pandas 数据框 [英] Merging two pandas dataframes with complex conditions
问题描述
我想合并两个数据框。让我们考虑以下两个df:
I would like to merge two dataframes. Let's consider the following two dfs:
df1:
id_A, ts_A, course, weight
id1, 2017-04-27 01:35:30, cotton, 3.5
id1, 2017-04-27 01:36:05, cotton, 3.5
id1, 2017-04-27 01:36:55, cotton, 3.5
id1, 2017-04-27 01:37:20, cotton, 3.5
id2, 2017-04-27 02:35:35, cotton blue, 5.0
id2, 2017-04-27 02:36:00, cotton blue, 5.0
id2, 2017-04-27 02:36:35, cotton blue, 5.0
id2, 2017-04-27 02:37:20, cotton blue, 5.0
df2:
id_B, ts_B, value
id1, 2017-03-27 01:25:40, 100
id1, 2017-03-27 01:25:50, 200
id1, 2017-03-27 01:25:50, 230
id1, 2017-04-27 01:35:40, 240
id1, 2017-04-27 01:35:50, 200
id1, 2017-04-27 01:36:00, 350
id1, 2017-04-27 01:36:10, 400
id1, 2017-04-27 01:36:20, 500
id1, 2017-04-27 01:36:30, 600
id1, 2017-04-27 01:36:40, 700
id1, 2017-04-27 01:36:50, 800
id1, 2017-04-27 01:37:00, 900
id1, 2017-04-27 01:37:10, 1000
id2, 2017-04-27 02:35:40, 1000
id2, 2017-04-27 02:35:50, 2000
id2, 2017-04-27 02:36:00, 4500
id2, 2017-04-27 02:36:10, 3000
id2, 2017-04-27 02:36:20, 6000
id2, 2017-04-27 02:36:30, 5000
id2, 2017-04-27 02:36:40, 5022
id2, 2017-04-27 02:36:50, 5040
id2, 2017-04-27 02:37:00, 3200
id2, 2017-04-27 02:37:10, 9000
df1应该与df2合并,以便满足以下条件:
鉴于时间间隔为df1中两个连续行之间的差,我想将其与平均值合并在该时间间隔内df2中所有行的总数。例如,
df1 should be merged with df2 such that the following condition holds: Given the time interval as the difference between two consecutive rows in df1, I want to merge it with the average value of all the rows in df2 that follow within that time interval. For example,
id_A, ts_A, course, weight
id1, 2017-04-27 01:35:30, cotton, 3.5
应合并
id_B, ts_B, value
id1, 2017-04-27 01:35:40, 240
id1, 2017-04-27 01:35:50, 200
id1, 2017-04-27 01:36:00, 350
并获得
id_A, ts_A, course, weight avgValue
id1, 2017-04-27 01:35:30, cotton, 3.5 263.3
我试图从另一个角度看问题-包括缺少df2的行进入df1-使用 merge_asof
,但我没有得到正确的结果:
I tried to see the problem from another perspective - which would include the missing rows of df2 into df1 - by using merge_asof
but I do not get the right result:
pd.merge_asof(df2_sorted, df1, left_on='ts_B', right_on='ts_A', left_by='id_B', right_by='id_A', direction='backward')
推荐答案
我认为您需要 merge_asof
,但计数器是使用 reset_index
表示 df1
中每行的唯一值:
I think you need merge_asof
, but for counter is used reset_index
for unique value per row in df1
:
df1 = df1.reset_index(drop=True)
print (df1.index)
RangeIndex(start=0, stop=8, step=1)
df = pd.merge_asof(df2_sorted,
df1.reset_index(),
left_on='ts_B',
right_on='ts_A',
left_by='id_B',
right_by='id_A')
然后按输出列分组(不要忘记 index
列)和总计平均值
:
And then groupby by output columns (dont forget for index
column) and aggregate mean
:
df = df.groupby(['id_A','ts_A', 'course', 'weight', 'index'], as_index=False)['value']
.mean()
.drop('index', axis=1)
print (df)
id_A ts_A course weight value
0 id1 2017-04-27 01:35:30 cotton 3.5 263.333333
1 id1 2017-04-27 01:36:05 cotton 3.5 600.000000
2 id1 2017-04-27 01:36:55 cotton 3.5 950.000000
3 id2 2017-04-27 02:35:35 cotton blue 5.0 1500.000000
4 id2 2017-04-27 02:36:00 cotton blue 5.0 4625.000000
5 id2 2017-04-27 02:36:35 cotton blue 5.0 5565.500000
这篇关于合并具有复杂条件的两个 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!