合并具有复杂条件的两个 pandas 数据框 [英] Merging two pandas dataframes with complex conditions

查看:63
本文介绍了合并具有复杂条件的两个 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想合并两个数据框。让我们考虑以下两个df:

I would like to merge two dataframes. Let's consider the following two dfs:

df1:

id_A,           ts_A,    course,     weight
id1, 2017-04-27 01:35:30, cotton,      3.5
id1, 2017-04-27 01:36:05, cotton,      3.5
id1, 2017-04-27 01:36:55, cotton,      3.5
id1, 2017-04-27 01:37:20, cotton,      3.5
id2, 2017-04-27 02:35:35, cotton blue, 5.0
id2, 2017-04-27 02:36:00, cotton blue, 5.0
id2, 2017-04-27 02:36:35, cotton blue, 5.0
id2, 2017-04-27 02:37:20, cotton blue, 5.0

df2:

id_B,  ts_B,                 value
id1,   2017-03-27 01:25:40,  100
id1,   2017-03-27 01:25:50,  200
id1,   2017-03-27 01:25:50,  230
id1,   2017-04-27 01:35:40,  240
id1,   2017-04-27 01:35:50,  200
id1,   2017-04-27 01:36:00,  350
id1,   2017-04-27 01:36:10,  400
id1,   2017-04-27 01:36:20,  500
id1,   2017-04-27 01:36:30,  600
id1,   2017-04-27 01:36:40,  700
id1,   2017-04-27 01:36:50,  800
id1,   2017-04-27 01:37:00,  900
id1,   2017-04-27 01:37:10, 1000
id2,   2017-04-27 02:35:40,  1000
id2,   2017-04-27 02:35:50,  2000
id2,   2017-04-27 02:36:00,  4500
id2,   2017-04-27 02:36:10,  3000
id2,   2017-04-27 02:36:20,  6000
id2,   2017-04-27 02:36:30,  5000
id2,   2017-04-27 02:36:40,  5022
id2,   2017-04-27 02:36:50,  5040
id2,   2017-04-27 02:37:00,  3200
id2,   2017-04-27 02:37:10,  9000

df1应该与df2合并,以便满足以下条件:
鉴于时间间隔为df1中两个连续行之间的差,我想将其与平均值合并在该时间间隔内df2中所有行的总数。例如,

df1 should be merged with df2 such that the following condition holds: Given the time interval as the difference between two consecutive rows in df1, I want to merge it with the average value of all the rows in df2 that follow within that time interval. For example,

id_A,           ts_A,    course,     weight
id1, 2017-04-27 01:35:30, cotton,      3.5

应合并

id_B,  ts_B,                 value
id1,   2017-04-27 01:35:40,  240
id1,   2017-04-27 01:35:50,  200
id1,   2017-04-27 01:36:00,  350

并获得

id_A,           ts_A,    course,     weight  avgValue
id1, 2017-04-27 01:35:30, cotton,      3.5  263.3

我试图从另一个角度看问题-包括缺少df2的行进入df1-使用 merge_asof ,但我没有得到正确的结果:

I tried to see the problem from another perspective - which would include the missing rows of df2 into df1 - by using merge_asof but I do not get the right result:

pd.merge_asof(df2_sorted, df1, left_on='ts_B', right_on='ts_A', left_by='id_B', right_by='id_A', direction='backward')


推荐答案

我认为您需要 merge_asof ,但计数器是使用 reset_index 表示 df1 中每行的唯一值:

I think you need merge_asof, but for counter is used reset_index for unique value per row in df1:

df1 = df1.reset_index(drop=True)
print (df1.index)
RangeIndex(start=0, stop=8, step=1)

df = pd.merge_asof(df2_sorted, 
                   df1.reset_index(), 
                   left_on='ts_B', 
                   right_on='ts_A', 
                   left_by='id_B', 
                   right_by='id_A')

然后按输出列分组(不要忘记 index 列)和总计平均值

And then groupby by output columns (dont forget for index column) and aggregate mean:

df = df.groupby(['id_A','ts_A', 'course', 'weight', 'index'], as_index=False)['value']
       .mean()
       .drop('index', axis=1)
print (df)
  id_A                ts_A       course  weight        value
0  id1 2017-04-27 01:35:30       cotton     3.5   263.333333
1  id1 2017-04-27 01:36:05       cotton     3.5   600.000000
2  id1 2017-04-27 01:36:55       cotton     3.5   950.000000
3  id2 2017-04-27 02:35:35  cotton blue     5.0  1500.000000
4  id2 2017-04-27 02:36:00  cotton blue     5.0  4625.000000
5  id2 2017-04-27 02:36:35  cotton blue     5.0  5565.500000

这篇关于合并具有复杂条件的两个 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆