pandas :如何通过保留第一个数据框的信息来合并两个数据框? [英] Pandas: how to merge two dataframes on a column by keeping the information of the first one?
问题描述
我有两个数据框 df1
和 df2
。 df1
包含人的年龄信息,而 df2
包含人的性别信息。并非所有人都在 df1
或 df2
df1
名字年龄
0汤姆34
1萨拉18
2伊娃44
3杰克27
4劳拉30
df2
名字性别
0 Tom M
1 Paul M
2 Eva F
3 Jack M
4 Michelle F
我想在中获取人们的性别信息df1
并设置 NaN
(如果我在 df2
中没有此信息)。我试图做 df1 = pd.merge(df1,df2,on ='Name',how ='outer')
但我将某些人的信息保留在<$我不想要的c $ c> df2 。
df1
名字年龄性别
0汤姆3千4百万
1萨拉18奈恩
2伊娃44 F
3杰克27百万
4劳拉30奈恩
示例
:
df1 = pd.DataFrame({'Name':['Tom','Sara','Eva','Jack','Laura' ],
'Age':[34,18,44,27,30]})
#print(df1)
df3 = df1.copy()
df2 = pd.DataFrame({'Name':['Tom','Paul','Eva','Jack','Michelle'],
'Sex':['M' ,'M','F','M','F']})
#print(df2)
使用 map
由 Series
创建者 set_index
:
df1 ['Sex'] = df1 ['Name']。map(df2.set_index( 'Name')['Sex'])
print(df1)
名称年龄性别
0 Tom 34 M
1 Sara 18 NaN
2 Eva 44 F
3杰克27 M
4劳拉30 NaN
具有 合并
左连接:
df = df3.merge(df2 [[''Name','Sex']],on ='名称',方式='左')
打印(df)
名称年龄性别
0汤姆34 M
1萨拉18 NaN
2 Eva 44 F
3杰克27 M
4劳拉30 NaN
如果需要按多列映射(例如年
和代码
)需要 merge
并带有左联接:
df1 = pd.DataFrame({'Name':['Tom','Sara','Eva','Jack' ,劳拉],
年份:[2000,2003,2003,2004,2007],
代码:[1,2,3,4,4],
'年龄':[34、18、44、27、30]})
打印(df1)
名称年份代码年龄
0 Tom 2000 1 34
1萨拉2003 2 18
2伊娃2003 3 44
3杰克2004 4 27
4劳拉2007 4 30
df2 = pd.DataFrame({'Name' :['Tom','Paul','Eva','Jack','Michelle'],
'Sex':['M','M','F','M','F '],
'年份':[2001,2003,2003,2004,2007],
'代码':[1,2,3,5,3],
'Val' :[21,34,23,44,67]})
打印(df2)
名称性别年份代码Val
0 Tom M 2001 1 21
1 Paul M 2003 2 34
2 Eva F 20 03 3 23
3 Jack M 2004 5 44
4 Michelle F 2007 3 67
#由所有列合并
df = df1.merge(df2,on = ['Year','Code'],how = '左')
打印(df)
Name_x年码年龄Name_y性别Val
0 Tom 2000 1 34 NaN NaN NaN
1 Sara 2003 2 18 Paul M 34.0
2 Eva 2003 3 44 Eva F 23.0
3 Jack 2004 4 27 NaN NaN NaN
4 Laura 2007 4 30 NaN NaN NaN
#指定的列-加入的列(年份,代码)总是需要+附加列(Val)
df = df1.merge(df2 [['Year','Code','Val']],on = ['Year','Code'], how ='left')
打印(df)
名称年份代码年龄Val
0 Tom 2000 1 34 NaN
1 Sara 2003 2 18 34.0
2 Eva 2003 3 44 23.0
3 Jack 2004 4 27 NaN
4 Laura 2007 4 30 NaN
< hr>
如果 map
出现错误,则表示按连接列重复,这里名称
:
df1 = pd.DataFrame({'Name':['Tom','Sara','Eva','Jack','Laura'] ,
'Age':[34、18、44、27、30]})
打印(df1)
名称年龄
0汤姆34
1萨拉18
2伊娃44
3杰克27
4劳拉30
df3,df4 = df1.copy(),df1.copy()
df2 = pd.DataFrame({'Name':['Tom','Tom','Eva','Jack','Michelle'],
'Val':[1, 2,3,4,5]})
打印(df2)
名称Val
0 Tom 1<-重复名称Tom
1 Tom 2<-重复名称Tom
2 Eva 3
3 Jack 4
4 Michelle 5
s = df2.set_index('Name')['Val']
df1 [' New'] = df1 ['Name']。map
print(df1)
< blockquote>
InvalidIndexError:仅对唯一值索引对象有效的索引
解决方案通过 DataFrame.drop_duplicates
,或使用地图由 dict
进行上次重复匹配:
#default保留第一个值
s = df2.drop_duplicates('Name')。set_index('Name') ['Val']
打印(s)
名称
Tom 1
Eva 3
Jack 4
Michelle 5
名称:Val, dtype:int64
df1 ['New'] = df1 ['Name']。map
print(df1)
Name Age New
0 Tom 34 1.0
1萨拉18 NaN
2伊娃44 3.0
3杰克27 4.0
4劳拉30 NaN
#add参数,保持最后一个值
s = df2.drop_duplicates('Name', keep ='last')。set_index('Name')['Val']
打印(s)
名字
Tom 2
Eva 3
Jack 4
米歇尔5
名称:Val,dtype:int64
df3 ['New'] = df3 ['Name']。map
print(df3)
名称年龄新
0汤姆34 2.0
1萨拉18 NaN
2伊娃44 3.0
3杰克27 4.0
4劳拉30 NaN
#map按字典
d = dict(zip(df2 ['名称'],df2 ['Val']))
打印(d)
{'Tom':2,'Eva':3,'Jack':4,'Michelle':5}
df4 ['New'] = df4 ['Name']。map(d)
打印(df4)
名称年龄新的
0 Tom 34 2.0
1萨拉18 NaN
2伊娃44 3.0
3杰克27 4.0
4劳拉30 NaN
I have two dataframes df1
and df2
. df1
contains the information of the age of people, while df2
contains the information of the sex of people. Not all the people are in df1
nor in df2
df1
Name Age
0 Tom 34
1 Sara 18
2 Eva 44
3 Jack 27
4 Laura 30
df2
Name Sex
0 Tom M
1 Paul M
2 Eva F
3 Jack M
4 Michelle F
I want to have the information of the sex of the people in df1
and setting NaN
if I do not have this information in df2
. I tried to do df1 = pd.merge(df1, df2, on = 'Name', how = 'outer')
but I keep the information of some people in df2
that I don't want.
df1
Name Age Sex
0 Tom 34 M
1 Sara 18 NaN
2 Eva 44 F
3 Jack 27 M
4 Laura 30 NaN
Sample
:
df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'],
'Age': [34, 18, 44, 27, 30]})
#print (df1)
df3 = df1.copy()
df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'],
'Sex': ['M', 'M', 'F', 'M', 'F']})
#print (df2)
Use map
by Series
created by set_index
:
df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
print (df1)
Name Age Sex
0 Tom 34 M
1 Sara 18 NaN
2 Eva 44 F
3 Jack 27 M
4 Laura 30 NaN
Alternative solution with merge
with left join:
df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
print (df)
Name Age Sex
0 Tom 34 M
1 Sara 18 NaN
2 Eva 44 F
3 Jack 27 M
4 Laura 30 NaN
If need map by multiple columns (e.g. Year
and Code
) need merge
with left join:
df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'],
'Year':[2000,2003,2003,2004,2007],
'Code':[1,2,3,4,4],
'Age': [34, 18, 44, 27, 30]})
print (df1)
Name Year Code Age
0 Tom 2000 1 34
1 Sara 2003 2 18
2 Eva 2003 3 44
3 Jack 2004 4 27
4 Laura 2007 4 30
df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'],
'Sex': ['M', 'M', 'F', 'M', 'F'],
'Year':[2001,2003,2003,2004,2007],
'Code':[1,2,3,5,3],
'Val':[21,34,23,44,67]})
print (df2)
Name Sex Year Code Val
0 Tom M 2001 1 21
1 Paul M 2003 2 34
2 Eva F 2003 3 23
3 Jack M 2004 5 44
4 Michelle F 2007 3 67
#merge by all columns
df = df1.merge(df2, on=['Year','Code'], how='left')
print (df)
Name_x Year Code Age Name_y Sex Val
0 Tom 2000 1 34 NaN NaN NaN
1 Sara 2003 2 18 Paul M 34.0
2 Eva 2003 3 44 Eva F 23.0
3 Jack 2004 4 27 NaN NaN NaN
4 Laura 2007 4 30 NaN NaN NaN
#specified columns - columns for join (Year, Code) need always + appended columns (Val)
df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
print (df)
Name Year Code Age Val
0 Tom 2000 1 34 NaN
1 Sara 2003 2 18 34.0
2 Eva 2003 3 44 23.0
3 Jack 2004 4 27 NaN
4 Laura 2007 4 30 NaN
If get error with map
it means duplicates by columns of join, here Name
:
df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'],
'Age': [34, 18, 44, 27, 30]})
print (df1)
Name Age
0 Tom 34
1 Sara 18
2 Eva 44
3 Jack 27
4 Laura 30
df3, df4 = df1.copy(), df1.copy()
df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'],
'Val': [1,2,3,4,5]})
print (df2)
Name Val
0 Tom 1 <-duplicated name Tom
1 Tom 2 <-duplicated name Tom
2 Eva 3
3 Jack 4
4 Michelle 5
s = df2.set_index('Name')['Val']
df1['New'] = df1['Name'].map(s)
print (df1)
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Solutions are removed duplicates by DataFrame.drop_duplicates
, or use map by dict
for last dupe match:
#default keep first value
s = df2.drop_duplicates('Name').set_index('Name')['Val']
print (s)
Name
Tom 1
Eva 3
Jack 4
Michelle 5
Name: Val, dtype: int64
df1['New'] = df1['Name'].map(s)
print (df1)
Name Age New
0 Tom 34 1.0
1 Sara 18 NaN
2 Eva 44 3.0
3 Jack 27 4.0
4 Laura 30 NaN
#add parameter for keep last value
s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
print (s)
Name
Tom 2
Eva 3
Jack 4
Michelle 5
Name: Val, dtype: int64
df3['New'] = df3['Name'].map(s)
print (df3)
Name Age New
0 Tom 34 2.0
1 Sara 18 NaN
2 Eva 44 3.0
3 Jack 27 4.0
4 Laura 30 NaN
#map by dictionary
d = dict(zip(df2['Name'], df2['Val']))
print (d)
{'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}
df4['New'] = df4['Name'].map(d)
print (df4)
Name Age New
0 Tom 34 2.0
1 Sara 18 NaN
2 Eva 44 3.0
3 Jack 27 4.0
4 Laura 30 NaN
这篇关于 pandas :如何通过保留第一个数据框的信息来合并两个数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!