使用 pandas 对两个数据框进行多次操作 [英] multiple operations on two dataframes using pandas
问题描述
这是我上一个问题的扩展在此处输入链接描述
This is an extension of my previous question enter link description here
我有两个长度不同的数据帧df1和df2,还有两个列作为键列。我想对这些数据帧执行以下多个操作:
I have two dataframes df1 and df2 of different lengths and two columns as key columns. I would like to perform multiple operations on these dataframes as follows:
- 仅用df2中的相应值替换df1中的空白(NAs)单元根据每个键列对的键列
- ,应该在新数据框中报告两个数据框中值相矛盾的单元格
df1
df1
id_col1 id_col2 name age sex
---------------------------------------
101 1M 21
101 3M 21 M
102 1M Mark 25
df2
df2
id_col1 id_col2 name age sex
-------------------------------------------
101 1M Steve M
101 2M M
101 3M Steve 25
102 1M Ria 25 M
102 2M Anie 22 F
执行操作1后,即用df2中的相应值替换df1中的NA。得到以下内容:
After performing operation 1, i.e. replacing NA's in df1 with the corresponding values from df2, I should get the following:
id_col1 id_col2 name age sex
-------------------------------------------
101 1M Steve 21 M
101 3M Steve 25 M
102 1M Mark 25 M
执行完操作2后,即同一键列的df1和df2中的单元格冲突,我应该得到以下内容:
After performing operation 2, i.e. conflicting cells in df1 and df2 for the same key columns, I should get the following:
id_col1 id_col2 name age sex
-------------------------------------------
101 3M 21
101 3M 25
102 1M Mark
102 1M Ria
有人可以帮助解决这些问题吗?
Can anyone help in solving these?
推荐答案
我采用的方法与原始问题的答案非常相似。
The approach I would take is very similar to the answer to the original problem.
设置id列作为索引,请使用 combine_first
,如原始文章中所述。但是,由于 combine_first
返回两个数据框标签(即行和列)的并集,因此在应用后,仅选择属于 df1的那些索引
Set the id columns as index, use combine_first
, as in the original post. However, since combine_first
returns the union of both the dataframes' labels (i.e. rows and columns), after its application, select only those indexes belonging to df1
idx = ['id_col1', 'id_col2']
df1 = df1.set_index(idx)
df2 = df2.set_index(idx)
result_1 = df1.combine_first(df2).loc[df1.index]
# result_1 outputs:
name age sex
id_col1 id_col2
101 1M Steve 21.0 M
3M Steve 21.0 M
102 1M Mark 25.0 M
要生成 result_2
,请首先按照先前帖子中的建议进行操作:
To generate result_2
, first follow the approach as suggested in the earlier post:
mask = pd.notnull(df1) & ~df1.eq(df2) & pd.notnull(df2)
result_2 = pd.concat([df1[mask], df2[mask]]).dropna(how='all')
这会生成所需的数据,但是顺序与您显示的内容略有不同,因为 df1 [mask]
堆积在 df2 [mask]
之上。
This generates the desired data, however the order is slightly off from what you present because df1[mask]
is stacked on top of df2[mask]
.
按索引排序以获得最终结果:
sort by index to get the final result:
result_2.sort_index()
# outputs
name age sex
id_col1 id_col2
101 3M NaN 21.0 NaN
3M NaN 25.0 NaN
102 1M Mark NaN NaN
1M Ria NaN NaN
与第一篇文章中介绍的解决方案相比,此解决方案的唯一区别是额外的 ... loc [df1.index]
和 result_2.sort_index()
The only difference in this solution when compared to the solution presented in the first post is an extra ...loc[df1.index]
and result_2.sort_index()
这篇关于使用 pandas 对两个数据框进行多次操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!