使用完全外部联接在 pandas 中联接两个数据框 [英] Joining two dataframes in pandas using full outer join
问题描述
我在熊猫中有两个数据框,如下所示. EmpID是两个数据帧中的主键.
I've two dataframes in pandas as shown below. EmpID is a primary key in both dataframes.
df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])
我想将这两个数据框与EmpID结合在一起,以便
I want to join these two dataframes with EmpID so that
- 如果一个数据帧中的缺失数据存在且键匹配,则可以用另一张表中的值填充
- 如果有带有新键的观测值,则应将其附加到结果数据框中
我已经使用以下代码实现了这一目标.
I've used below code for achieving this.
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
但是此代码为我提供了我不想要的重复列,因此我仅使用两个表中的唯一列进行合并.
But this code gives me duplicate columns which I don't want so I only used unique columns from both tables for merging.
ColNames = list(df_second.columns.difference(df_first.columns))
ColNames.append('EmpID')
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
现在我不会得到重复的列,但是在键匹配的观察结果中也不会得到值.
Now I don't get duplicate columns but don't get value either in observations where key matches.
如果有人可以帮助我,我将非常感激.
I'll really appreciate if someone can help me with this.
关于, 凯拉什·奈吉
推荐答案
似乎您需要 set_index
,用于匹配由EmpID
列创建的索引:
It seems you need combine_first
with set_index
for match by indices created by columns EmpID
:
df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
EmpID Department Location Name Salary
0 1 HR Delhi A 1000.0
1 2 NaN NaN B NaN
2 3 Finance NaN C 3000.0
3 4 NaN NaN D 8000.0
4 5 Programming NaN E 6000.0
5 8 Admin Mumbai B NaN
6 9 Ops Banglore D NaN
7 10 Analytics Mumbai K NaN
对于某些列顺序,需要 reindex
:
For some order of columns need reindex
:
#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')
df = (df_first.set_index('EmpID')
.combine_first(df_second.set_index('EmpID'))
.reset_index()
.reindex(columns=ColNames))
print (df)
EmpID Name Department Location Salary
0 1 A HR Delhi 1000.0
1 2 B NaN NaN NaN
2 3 C Finance NaN 3000.0
3 4 D NaN NaN 8000.0
4 5 E Programming NaN 6000.0
5 8 B Admin Mumbai NaN
6 9 D Ops Banglore NaN
7 10 K Analytics Mumbai NaN
这篇关于使用完全外部联接在 pandas 中联接两个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!