pandas :如何通过保留第一个数据框的信息来合并两个数据框? [英] Pandas: how to merge two dataframes on a column by keeping the information of the first one?

查看:69
本文介绍了 pandas :如何通过保留第一个数据框的信息来合并两个数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框 df1 df2 df1 包含人的年龄信息,而 df2 包含人的性别信息。并非所有人都在 df1 df2



  df1 
名字年龄
0汤姆34
1萨拉18
2伊娃44
3杰克27
4劳拉30

df2
名字性别
0 Tom M
1 Paul M
2 Eva F
3 Jack M
4 Michelle F

我想在中获取人们的性别信息df1 并设置 NaN (如果我在 df2 中没有此信息)。我试图做 df1 = pd.merge(df1,df2,on ='Name',how ='outer')但我将某些人的信息保留在<$我不想要的c $ c> df2 。

  df1 
名字年龄性别
0汤姆3千4百万
1萨拉18奈恩
2伊娃44 F
3杰克27百万
4劳拉30奈恩


解决方案

示例

  df1 = pd.DataFrame({'Name':['Tom','Sara','Eva','Jack','Laura' ],
'Age':[34,18,44,27,30]})

#print(df1)
df3 = df1.copy()

df2 = pd.DataFrame({'Name':['Tom','Paul','Eva','Jack','Michelle'],
'Sex':['M' ,'M','F','M','F']})
#print(df2)

使用 map Series 创建者 set_index

  df1 ['Sex'] = df1 ['Name']。map(df2.set_index( 'Name')['Sex'])
print(df1)
名称年龄性别
0 Tom 34 M
1 Sara 18 NaN
2 Eva 44 F
3杰克27 M
4劳拉30 NaN

具有 合并 左连接:

  df = df3.merge(df2 [[''Name','Sex']],on ='名称',方式='左')
打印(df)
名称年龄性别
0汤姆34 M
1萨拉18 NaN
2 Eva 44 F
3杰克27 M
4劳拉30 NaN






如果需要按多列映射(例如代码)需要 merge 并带有左联接:

  df1 = pd.DataFrame({'Name':['Tom','Sara','Eva','Jack' ,劳拉],
年份:[2000,2003,2003,2004,2007],
代码:[1,2,3,4,4],
'年龄':[34、18、44、27、30]})

打印(df1)
名称年份代码年龄
0 Tom 2000 1 34
1萨拉2003 2 18
2伊娃2003 3 44
3杰克2004 4 27
4劳拉2007 4 30

df2 = pd.DataFrame({'Name' :['Tom','Paul','Eva','Jack','Michelle'],
'Sex':['M','M','F','M','F '],
'年份':[2001,2003,2003,2004,2007],
'代码':[1,2,3,5,3],
'Val' :[21,34,23,44,67]})
打印(df2)
名称性别年份代码Val
0 Tom M 2001 1 21
1 Paul M 2003 2 34
2 Eva F 20 03 3 23
3 Jack M 2004 5 44
4 Michelle F 2007 3 67





 #由所有列合并
df = df1.merge(df2,on = ['Year','Code'],how = '左')
打印(df)
Name_x年码年龄Name_y性别Val
0 Tom 2000 1 34 NaN NaN NaN
1 Sara 2003 2 18 Paul M 34.0
2 Eva 2003 3 44 Eva F 23.0
3 Jack 2004 4 27 NaN NaN NaN
4 Laura 2007 4 30 NaN NaN NaN

#指定的列-加入的列(年份,代码)总是需要+附加列(Val)
df = df1.merge(df2 [['Year','Code','Val']],on = ['Year','Code'], how ='left')
打印(df)
名称年份代码年龄Val
0 Tom 2000 1 34 NaN
1 Sara 2003 2 18 34.0
2 Eva 2003 3 44 23.0
3 Jack 2004 4 27 NaN
4 Laura 2007 4 30 NaN



< hr>

如果 map 出现错误,则表示按连接列重复,这里名称

  df1 = pd.DataFrame({'Name':['Tom','Sara','Eva','Jack','Laura'] ,
'Age':[34、18、44、27、30]})

打印(df1)
名称年龄
0汤姆34
1萨拉18
2伊娃44
3杰克27
4劳拉30

df3,df4 = df1.copy(),df1.copy()

df2 = pd.DataFrame({'Name':['Tom','Tom','Eva','Jack','Michelle'],
'Val':[1, 2,3,4,5]})
打印(df2)
名称Val
0 Tom 1<-重复名称Tom
1 Tom 2<-重复名称Tom
2 Eva 3
3 Jack 4
4 Michelle 5

s = df2.set_index('Name')['Val']
df1 [' New'] = df1 ['Name']。map
print(df1)



< blockquote>

InvalidIndexError:仅对唯一值索引对象有效的索引


解决方案通过 DataFrame.drop_duplicates ,或使用地图由 dict 进行上次重复匹配:

  #default保留第一个值
s = df2.drop_duplicates('Name')。set_index('Name') ['Val']
打印(s)
名称
Tom 1
Eva 3
Jack 4
Michelle 5
名称:Val, dtype:int64

df1 ['New'] = df1 ['Name']。map
print(df1)
Name Age New
0 Tom 34 1.0
1萨拉18 NaN
2伊娃44 3.0
3杰克27 4.0
4劳拉30 NaN





  #add参数,保持最后一个值
s = df2.drop_duplicates('Name', keep ='last')。set_index('Name')['Val']
打印(s)
名字
Tom 2
Eva 3
Jack 4
米歇尔5
名称:Val,dtype:int64

df3 ['New'] = df3 ['Name']。map
print(df3)
名称年龄新
0汤姆34 2.0
1萨拉18 NaN
2伊娃44 3.0
3杰克27 4.0
4劳拉30 NaN





  #map按字典
d = dict(zip(df2 ['名称'],df2 ['Val']))
打印(d)
{'Tom':2,'Eva':3,'Jack':4,'Michelle':5}

df4 ['New'] = df4 ['Name']。map(d)
打印(df4)
名称年龄新的
0 Tom 34 2.0
1萨拉18 NaN
2伊娃44 3.0
3杰克27 4.0
4劳拉30 NaN


I have two dataframes df1 and df2. df1 contains the information of the age of people, while df2 contains the information of the sex of people. Not all the people are in df1 nor in df2

df1
     Name   Age 
0     Tom    34
1     Sara   18
2     Eva    44
3     Jack   27
4     Laura  30

df2
     Name      Sex 
0     Tom       M
1     Paul      M
2     Eva       F
3     Jack      M
4     Michelle  F

I want to have the information of the sex of the people in df1 and setting NaN if I do not have this information in df2. I tried to do df1 = pd.merge(df1, df2, on = 'Name', how = 'outer') but I keep the information of some people in df2 that I don't want.

df1
     Name   Age     Sex
0     Tom    34      M
1     Sara   18     NaN
2     Eva    44      F
3     Jack   27      M
4     Laura  30     NaN

解决方案

Sample:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Age': [34, 18, 44, 27, 30]})

#print (df1)
df3 = df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                    'Sex': ['M', 'M', 'F', 'M', 'F']})
#print (df2)

Use map by Series created by set_index:

df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
print (df1)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN

Alternative solution with merge with left join:

df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
print (df)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN


If need map by multiple columns (e.g. Year and Code) need merge with left join:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Year':[2000,2003,2003,2004,2007],
                    'Code':[1,2,3,4,4],
                    'Age': [34, 18, 44, 27, 30]})

print (df1)
    Name  Year  Code  Age
0    Tom  2000     1   34
1   Sara  2003     2   18
2    Eva  2003     3   44
3   Jack  2004     4   27
4  Laura  2007     4   30

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                    'Sex': ['M', 'M', 'F', 'M', 'F'],
                    'Year':[2001,2003,2003,2004,2007],
                    'Code':[1,2,3,5,3],
                    'Val':[21,34,23,44,67]})
print (df2)
       Name Sex  Year  Code  Val
0       Tom   M  2001     1   21
1      Paul   M  2003     2   34
2       Eva   F  2003     3   23
3      Jack   M  2004     5   44
4  Michelle   F  2007     3   67

#merge by all columns
df = df1.merge(df2, on=['Year','Code'], how='left')
print (df)
  Name_x  Year  Code  Age Name_y  Sex   Val
0    Tom  2000     1   34    NaN  NaN   NaN
1   Sara  2003     2   18   Paul    M  34.0
2    Eva  2003     3   44    Eva    F  23.0
3   Jack  2004     4   27    NaN  NaN   NaN
4  Laura  2007     4   30    NaN  NaN   NaN

#specified columns - columns for join (Year, Code) need always + appended columns (Val)
df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
print (df)
    Name  Year  Code  Age   Val
0    Tom  2000     1   34   NaN
1   Sara  2003     2   18  34.0
2    Eva  2003     3   44  23.0
3   Jack  2004     4   27   NaN
4  Laura  2007     4   30   NaN


If get error with map it means duplicates by columns of join, here Name:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Age': [34, 18, 44, 27, 30]})

print (df1)
    Name  Age
0    Tom   34
1   Sara   18
2    Eva   44
3   Jack   27
4  Laura   30

df3, df4 = df1.copy(), df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'], 
                    'Val': [1,2,3,4,5]})
print (df2)
       Name  Val
0       Tom    1 <-duplicated name Tom
1       Tom    2 <-duplicated name Tom
2       Eva    3
3      Jack    4
4  Michelle    5

s = df2.set_index('Name')['Val']
df1['New'] = df1['Name'].map(s)
print (df1)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Solutions are removed duplicates by DataFrame.drop_duplicates, or use map by dict for last dupe match:

#default keep first value
s = df2.drop_duplicates('Name').set_index('Name')['Val']
print (s)
Name
Tom         1
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64

df1['New'] = df1['Name'].map(s)
print (df1)
    Name  Age  New
0    Tom   34  1.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN

#add parameter for keep last value 
s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
print (s)
Name
Tom         2
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64

df3['New'] = df3['Name'].map(s)
print (df3)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN

#map by dictionary
d = dict(zip(df2['Name'], df2['Val']))
print (d)
{'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}

df4['New'] = df4['Name'].map(d)
print (df4)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN

这篇关于 pandas :如何通过保留第一个数据框的信息来合并两个数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆