使用 pandas 对两个数据框进行多次操作 [英] multiple operations on two dataframes using pandas

查看：58 发布时间：2020/10/17 2:48:57 python pandas dataframe

本文介绍了使用 pandas 对两个数据框进行多次操作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

This is an extension of my previous question enter link description here

我有两个长度不同的数据帧df1和df2，还有两个列作为键列。我想对这些数据帧执行以下多个操作：

I have two dataframes df1 and df2 of different lengths and two columns as key columns. I would like to perform multiple operations on these dataframes as follows:

仅用df2中的相应值替换df1中的空白（NAs）单元根据每个键列对的键列

，应该在新数据框中报告两个数据框中值相矛盾的单元格

df1

id_col1   id_col2   name    age    sex
---------------------------------------
101         1M              21  
101         3M              21      M
102         1M      Mark    25

df2

id_col1    id_col2    name     age     sex
-------------------------------------------
101          1M       Steve             M
101          2M                         M
101          3M       Steve    25   
102          1M       Ria      25       M
102          2M       Anie     22       F

执行操作1后，即用df2中的相应值替换df1中的NA。得到以下内容：

After performing operation 1, i.e. replacing NA's in df1 with the corresponding values from df2, I should get the following:

id_col1    id_col2    name     age     sex
-------------------------------------------
101         1M        Steve    21      M
101         3M        Steve    25      M
102         1M        Mark     25      M

执行完操作2后，即同一键列的df1和df2中的单元格冲突，我应该得到以下内容：

After performing operation 2, i.e. conflicting cells in df1 and df2 for the same key columns, I should get the following:

id_col1    id_col2    name     age     sex
-------------------------------------------
101          3M                21   
101          3M                25   
102          1M        Mark     
102          1M        Ria

有人可以帮助解决这些问题吗？

Can anyone help in solving these?

推荐答案

我采用的方法与原始问题的答案非常相似。

The approach I would take is very similar to the answer to the original problem.

设置id列作为索引，请使用 combine_first ，如原始文章中所述。但是，由于 combine_first 返回两个数据框标签（即行和列）的并集，因此在应用后，仅选择属于 df1的那些索引

Set the id columns as index, use combine_first, as in the original post. However, since combine_first returns the union of both the dataframes' labels (i.e. rows and columns), after its application, select only those indexes belonging to df1

idx = ['id_col1', 'id_col2']
df1 = df1.set_index(idx)
df2 = df2.set_index(idx)
result_1 = df1.combine_first(df2).loc[df1.index]
# result_1 outputs:
                  name   age sex
id_col1 id_col2
101     1M       Steve  21.0   M
        3M       Steve  21.0   M
102     1M        Mark  25.0   M

要生成 result_2 ，请首先按照先前帖子中的建议进行操作：

To generate result_2, first follow the approach as suggested in the earlier post:

mask = pd.notnull(df1) & ~df1.eq(df2) & pd.notnull(df2)
result_2 = pd.concat([df1[mask], df2[mask]]).dropna(how='all')

这会生成所需的数据，但是顺序与您显示的内容略有不同，因为 df1 [mask] 堆积在 df2 [mask] 之上。

This generates the desired data, however the order is slightly off from what you present because df1[mask] is stacked on top of df2[mask].

按索引排序以获得最终结果：

sort by index to get the final result:

result_2.sort_index()
# outputs 
                 name   age  sex
id_col1 id_col2
101     3M        NaN  21.0  NaN
        3M        NaN  25.0  NaN
102     1M       Mark   NaN  NaN
        1M        Ria   NaN  NaN

与第一篇文章中介绍的解决方案相比，此解决方案的唯一区别是额外的 ... loc [df1.index] 和 result_2.sort_index（）

The only difference in this solution when compared to the solution presented in the first post is an extra ...loc[df1.index] and result_2.sort_index()

这篇关于使用 pandas 对两个数据框进行多次操作的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 pandas 对两个数据框进行多次操作 [英] multiple operations on two dataframes using pandas

问题描述

df1

df1

df2

df2

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 pandas 对两个数据框进行多次操作 [英] multiple operations on two dataframes using pandas

问题描述

df1

df1

df2

df2

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭