使用 Pandas 对两个数据帧进行多次合并操作 [英] multiple merge operations on two dataframes using pandas
问题描述
我有两个要实现多个操作的数据帧,例如:
I have two dataframes where multiple operations are to be implemented, for example:
old_DF
id col1 col2 col3
-------------------------
1 aaa
2 bbb 123
new_DF
id col1 col2 col3
-------------------------
1 xxx 999
2 xxx kkk
需要对这些数据帧执行以下操作:
The following operations need to be performed on these dataframes:
- 合并两个数据框
- 仅用 new_DF 中的相应值替换 old_DF 中的空白 (NA) 单元格
- 值相互矛盾的两个数据帧中的单元格应在新数据帧中报告
预期结果:
updated_df
id col1 col2 col3
-------------------------
1 aaa xxx 999
2 xxx bbb 123
conflicts_df
conflicts_df
id col1 col2 col3
-------------------------
2 bbb
2 kkk
我可以使用 .append()
方法来连接两个数据帧,我猜可以使用 .bfil()
或 .ffil()代码>方法来填充缺失值.但是我对
.bfil()
和 .ffil()
都没有成功.我试过 df.groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
但我没有得到想要的结果.此外,我不明白如何执行上述第 3 步.有没有人可以帮助解决这个问题?
I can use .append()
method to join the two dataframes and I guess one can use .bfil()
or .ffil()
methods to fill in the missing values. But I am unsuccessful with both .bfil()
and .ffil()
. I have tried df.groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
but I do not get the desired results. Additionally, I do not understand how to perform step 3 mentioned above. Is there anyone who can help with this problem?
推荐答案
设置:
old_df = pd.DataFrame([
[1, 'aaa', pd.NA, pd.NA],
[2, pd.NA, 'bbb', 123]],
columns=['id', 'col1', 'col2', 'col3'])
new_df = pd.DataFrame([
[1, pd.NA, 'xxx', 999],
[2, 'xxx', 'kkk', pd.NA]],
columns=['id', 'col1', 'col2', 'col3'])
使用 combine_first 获取updated_df
,设置id
为索引
Use combine_first to get the updated_df
, setting id
as the index
old_df = old_df.set_index('id')
new_df = new_df.set_index('id')
updated_df = old_df.combine_first(new_df)
# updated_df outputs:
# (reset the id if necessary)
col1 col2 col3
id
1 aaa xxx 999
2 xxx bbb 123
使用布尔逻辑生成 masks
的数据帧,检查旧的 &新帧在给定单元格中具有值值不同,并从 old & 中挑选单元格new 使用掩码中任何行为 True 的掩码
generate a dataframe of masks
using boolean logic, checking that both the old & new frames have values in a given cell & that the values differ, and pick cells from both old & new using the mask where any row in the mask is True
mask = pd.notnull(new_df) & ~old_df.eq(new_df) & pd.notnull(old_df)
conflicts_df = pd.concat([old_df[mask], new_df[mask]]).dropna(how='all')
# conflicts_df outputs
col1 col2 col3
id
2 NaN bbb NaN
2 NaN kkk NaN
这篇关于使用 Pandas 对两个数据帧进行多次合并操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!