根据条件合并 pandas 数据框 [英] Merge pandas Data Frames based on conditions

查看:103
本文介绍了根据条件合并 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件,显示有关产品交易的信息

I have two files which show information about a transaction over products

类型1的操作

d_op_1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3],'cost':[10,20,20,20,10,20,20,20],
                       'date':[2000,2006,2012,2000,2009,2009,2002,2006]})

类型2的操作

d_op_2 = pd.DataFrame({'id':[1,1,2,2,3,4,5,5],'cost':[3000,3100,3200,4000,4200,3400,2000,2500],
                       'date':[2010,2015,2008,2010,2006,2010,1990,2000]})

我只想保留那些寄存器,如果在类型2的两个操作之间存在类型1的操作. 例如.对于ID为"1"的产品,在两个类型2(2010,2015)的操作之间有一个类型1(2012)的操作,因此我想保留该记录.

I want to keep only those registers were there have been operations of type one between two operations of type 2. E.G. for the product wit the id "1" there was an operation of type 1 (2012) between two operations of type 2 (2010,2015) so I want to keep that record.

所需的输出云可以是:

或者这个:

使用pd.merge()我得到以下结果:

Using pd.merge() I got this result:

我该如何过滤以获得所需的输出?

How can I filter this to get the desired output?

推荐答案

您可以使用:

#concat DataFrames together             
df4 = pd.concat([d_op_1.rename(columns={'cost':'cost1'}), 
                 d_op_2.rename(columns={'cost':'cost2'})]).fillna(0).astype(int)

#print (df4)

#find max and min dates per goups
df3 = d_op_2.groupby('id')['date'].agg({'start':'min','end':'max'}) 
#print (df3)

#join max and min dates to concated df
df = df4.join(df3, on='id')
df = df[(df.date > df.start) & (df.date < df.end)]
#reshape df for min, max and dated between them
df = pd.melt(df, 
             id_vars=['id','cost1'], 
             value_vars=['date','start','end'], 
             value_name='date')
#remove columns
df = df.drop(['cost1','variable'], axis=1) \
       .drop_duplicates()
#merge to original, sorting
df = pd.merge(df, df4, on=['id', 'date']) \
       .sort_values(['id','date']).reset_index(drop=True)
#reorder columns
df = df[['id','cost1','cost2','date']]

print (df)
   id  cost1  cost2  date
0   1      0   3000  2010
1   1     20      0  2012
2   1      0   3100  2015
3   2      0   3200  2008
4   2     10      0  2009
5   2     20      0  2009
6   2      0   4000  2010

#if need lists for duplicates
df = df.groupby(['id','cost2', 'date'])['cost1'] \
       .apply(lambda x: list(x) if len(x) > 1 else x.values[0]) \
       .reset_index()
df = df[['id','cost1','cost2','date']]
print (df)
   id     cost1  cost2  date
0   1        20      0  2012
1   1         0   3000  2010
2   1         0   3100  2015
3   2  [10, 20]      0  2009
4   2         0   3200  2008
5   2         0   4000  2010

这篇关于根据条件合并 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆