如何通过查找值转换为不同列的列的值来连接两个数据框? [英] How to join two dataframe on by looking up values of a column whose values are transformed into different columns?

查看:29
本文介绍了如何通过查找值转换为不同列的列的值来连接两个数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有如下所示的 df1,

Suppose I have df1 like the following,

{'column1': ['AAA', 'AAA', 'BBB', 'BBB', 'CCC', 'CCC', 'CCC', 'CCC', 'DDD', 'DDD', 'AAA', 'AAA', 'AAA', 'EEE', 'EEE', 'FFF', 'GGG', 'FFF', 'GGG', 'BBB', 'BBB', 'BBB'], 'column2': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', nan, nan, 'C', 'D', 'D', 'D', 'D', 'D', 'D'], 'column3': ['AA', 'AA', 'AA', 'AA', 'BB', 'BB', 'BB', 'BB', 'CC', 'CC', 'CC', 'DD', 'DD', 'AA', 'AA', 'AA', 'CC', 'EE', 'EE', 'FF', 'FF', 'FF'], 'column4': ['Jan-20', 'Jan-21', 'Feb-21', 'Feb-20', 'Mar-21', 'Apr-21', 'Jul-21', 'Jul-20', 'Aug-21', 'Sep-21', 'Feb-21', 'Mar-21', 'Oct-21', 'Nov-21', 'Jan-21', 'Apr-21', 'Dec-21', 'Dec-21', 'Dec-20', 'Aug-21', 'Sep-21', 'Feb-21'], 'column5': [123, 214, 32, 12, 23, 2, 43, 14, 2, 113, 5, 6, 324, 13, 1, 123, 3245, 123, 5566, 132, 14, 21]}

在 column1、column2、column3 和 column4(column4 - 现在转换为列)的基础上,我想填充 column5.

On the basis of column1, column2, column3, and column4 (column4 - which is now transformed as columns) I want to populate column5.

我想要数据的 df2 应该是这样的,

The df2 in which I'd want data should look like this,

{'column1': ['AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB'], 'column1.5': ['V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'], 'column2': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], 'column3': ['GG', 'HH', 'AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'II', 'JJ', 'GG', 'HH', 'AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'II', 'JJ', 'GG', 'HH', 'AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'II', 'JJ', 'GG', 'HH', 'AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'II', 'JJ'], 'Jan-21': [nan, nan, 214.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Feb-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 5.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 21.0, nan, nan, nan, nan, 32.0, nan, nan, nan, nan, nan, nan, nan], 'Mar-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 6.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Apr-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'May-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Jun-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Jul-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Aug-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 132.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Sep-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 14.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Oct-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Nov-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Dec-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]}

df2 已经以上述格式存在.只需要更新月份列并保留当前年份的数据.

df2 already exists in the above format. Would just need to update the months columns and keep data of current year.

我该如何解决这个问题?

How may I approach this problem?

如果没有匹配项,我想加入这两个数据帧,同时保留 df2 的内容.

推荐答案

可能你需要把 df2 融化成 df1 的形状?

May be you need to melt df2 and it will have the same shape as df1?

nan = np.NAN

df1 = pd.DataFrame(
    {'column1': ['AAA', 'AAA', 'BBB', 'BBB', 'CCC', 'CCC', 
                 'CCC', 'CCC', 'DDD', 'DDD', 'AAA', 'AAA', 
                 'AAA', 'EEE', 'EEE', 'FFF', 'GGG', 'FFF', 
                 'GGG', 'BBB', 'BBB', 'BBB'], 
     'column2': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 
                 'B', 'B', 'B', 'C', nan, nan, 'C', 'D', 'D', 
                 'D', 'D', 'D', 'D'], 
     'column3': ['AA', 'AA', 'AA', 'AA', 'BB', 'BB', 'BB', 'BB', 
                 'CC', 'CC', 'CC', 'DD', 'DD', 'AA', 'AA', 'AA', 
                 'CC', 'EE', 'EE', 'FF', 'FF', 'FF'], 
     'column4': ['Jan-20', 'Jan-21', 'Feb-21', 'Feb-20', 'Mar-21', 
                 'Apr-21', 'Jul-21', 'Jul-20', 'Aug-21', 'Sep-21', 
                 'Feb-21', 'Mar-21', 'Oct-21', 'Nov-21', 'Jan-21', 
                 'Apr-21', 'Dec-21', 'Dec-21', 'Dec-20', 'Aug-21', 
                 'Sep-21', 'Feb-21'], 
     'column5': [123, 214, 32, 12, 23, 2, 43, 14, 2, 113, 5, 6, 324, 
                 13, 1, 123, 3245, 123, 5566, 132, 14, 21]}
)

df2 = pd.DataFrame(
    {'column1': ['AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 
                 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 
                 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'BBB', 
                 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 
                 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 
                 'BBB', 'BBB', 'BBB', 'BBB', 'BBB'], 
     'column1.5': ['V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 
                   'V', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 
                   'W', 'W', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 
                   'V', 'V', 'V', 'X', 'X', 'X', 'X', 'X', 'X', 
                   'X', 'X', 'X', 'X'], 
     'column2': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 
                 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 
                 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 
                 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], 
     'column3': ['GG', 'HH', 'AA', 'BB', 'CC', 'DD', 'EE', 'FF', 
                 'II', 'JJ', 'GG', 'HH', 'AA', 'BB', 'CC', 'DD', 
                 'EE', 'FF', 'II', 'JJ', 'GG', 'HH', 'AA', 'BB', 
                 'CC', 'DD', 'EE', 'FF', 'II', 'JJ', 'GG', 'HH', 
                 'AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'II', 'JJ'], 
     'Jan-21': [nan, nan, 214.0, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 
     'Feb-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, 5.0, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, 21.0, nan, nan, 
                nan, nan, 32.0, nan, nan, nan, nan, nan, nan, nan], 
     'Mar-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, 6.0, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 
     'Apr-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 
     'May-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 
     'Jun-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 
     'Jul-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 
     'Aug-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, 132.0, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 
     'Sep-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, 14.0, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 
     'Oct-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 
     'Nov-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 
     'Dec-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
                nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]}
)

如您所知,df2 是 df1 的数据透视表.你可以改造它.

As you know df2 it is a pivot table of df1. And you may transform it.

melted = df2.melt(
id_vars=['column1', 'column2', 'column3'], 
value_vars=[i for i in df2.columns if '-' in i],
var_name='column4', value_name='column5'
)

您的数据可能包含第 5 列中的不同数据.哪个是对的?

Your data may consist different data in column5. Which is right?

merged = df1.merge(melted, on=['column1', 'column2', 'column3', 'column4'], how='outer')

正如我所见这是不对的(数据框为空):

As I see It is not right (dataframe is empty):

merged[(merged.column5_x != merged.column5_y) # data in column5_x and column5_y is different
       &(merged.column5_x.notnull()) # and not NaN
       &(merged.column5_y.notnull()) # and not NaN
      ]

我认为删除重复项不是正确的方法,但是:

I think it is not proper way for dropping duplicates but:

merged.assign(
    column_5=lambda x: x.column5_x if not x.column5_x.empty else x.column5_y)  \
    .drop(columns=['column5_x', 'column5_y'])

另一个版本:

(merged  
    .set_index(['column1', 'column2', 'column3', 'column4']) # move columns to index
    .stack() # move column5_x and column5_y to one column   
    .reset_index() # bring to the initial form
    .drop(columns='level_4') # drop column with names: column5_x, column5_y 
    .drop_duplicates() # what you want 
    .rename(columns={0:'column_5'})) # final step

这篇关于如何通过查找值转换为不同列的列的值来连接两个数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆