如何通过查找值转换为不同列的列的值来连接两个数据框? [英] How to join two dataframe on by looking up values of a column whose values are transformed into different columns?
问题描述
假设我有如下所示的 df1,
Suppose I have df1 like the following,
{'column1': ['AAA', 'AAA', 'BBB', 'BBB', 'CCC', 'CCC', 'CCC', 'CCC', 'DDD', 'DDD', 'AAA', 'AAA', 'AAA', 'EEE', 'EEE', 'FFF', 'GGG', 'FFF', 'GGG', 'BBB', 'BBB', 'BBB'], 'column2': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', nan, nan, 'C', 'D', 'D', 'D', 'D', 'D', 'D'], 'column3': ['AA', 'AA', 'AA', 'AA', 'BB', 'BB', 'BB', 'BB', 'CC', 'CC', 'CC', 'DD', 'DD', 'AA', 'AA', 'AA', 'CC', 'EE', 'EE', 'FF', 'FF', 'FF'], 'column4': ['Jan-20', 'Jan-21', 'Feb-21', 'Feb-20', 'Mar-21', 'Apr-21', 'Jul-21', 'Jul-20', 'Aug-21', 'Sep-21', 'Feb-21', 'Mar-21', 'Oct-21', 'Nov-21', 'Jan-21', 'Apr-21', 'Dec-21', 'Dec-21', 'Dec-20', 'Aug-21', 'Sep-21', 'Feb-21'], 'column5': [123, 214, 32, 12, 23, 2, 43, 14, 2, 113, 5, 6, 324, 13, 1, 123, 3245, 123, 5566, 132, 14, 21]}
在 column1、column2、column3 和 column4(column4 - 现在转换为列)的基础上,我想填充 column5.
On the basis of column1, column2, column3, and column4 (column4 - which is now transformed as columns) I want to populate column5.
我想要数据的 df2 应该是这样的,
The df2 in which I'd want data should look like this,
{'column1': ['AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB'], 'column1.5': ['V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'], 'column2': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], 'column3': ['GG', 'HH', 'AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'II', 'JJ', 'GG', 'HH', 'AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'II', 'JJ', 'GG', 'HH', 'AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'II', 'JJ', 'GG', 'HH', 'AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'II', 'JJ'], 'Jan-21': [nan, nan, 214.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Feb-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 5.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 21.0, nan, nan, nan, nan, 32.0, nan, nan, nan, nan, nan, nan, nan], 'Mar-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 6.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Apr-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'May-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Jun-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Jul-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Aug-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 132.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Sep-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 14.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Oct-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Nov-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Dec-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]}
df2 已经以上述格式存在.只需要更新月份列并保留当前年份的数据.
df2 already exists in the above format. Would just need to update the months columns and keep data of current year.
我该如何解决这个问题?
How may I approach this problem?
如果没有匹配项,我想加入这两个数据帧,同时保留 df2 的内容.
推荐答案
可能你需要把 df2 融化成 df1 的形状?
May be you need to melt df2 and it will have the same shape as df1?
nan = np.NAN
df1 = pd.DataFrame(
{'column1': ['AAA', 'AAA', 'BBB', 'BBB', 'CCC', 'CCC',
'CCC', 'CCC', 'DDD', 'DDD', 'AAA', 'AAA',
'AAA', 'EEE', 'EEE', 'FFF', 'GGG', 'FFF',
'GGG', 'BBB', 'BBB', 'BBB'],
'column2': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B',
'B', 'B', 'B', 'C', nan, nan, 'C', 'D', 'D',
'D', 'D', 'D', 'D'],
'column3': ['AA', 'AA', 'AA', 'AA', 'BB', 'BB', 'BB', 'BB',
'CC', 'CC', 'CC', 'DD', 'DD', 'AA', 'AA', 'AA',
'CC', 'EE', 'EE', 'FF', 'FF', 'FF'],
'column4': ['Jan-20', 'Jan-21', 'Feb-21', 'Feb-20', 'Mar-21',
'Apr-21', 'Jul-21', 'Jul-20', 'Aug-21', 'Sep-21',
'Feb-21', 'Mar-21', 'Oct-21', 'Nov-21', 'Jan-21',
'Apr-21', 'Dec-21', 'Dec-21', 'Dec-20', 'Aug-21',
'Sep-21', 'Feb-21'],
'column5': [123, 214, 32, 12, 23, 2, 43, 14, 2, 113, 5, 6, 324,
13, 1, 123, 3245, 123, 5566, 132, 14, 21]}
)
df2 = pd.DataFrame(
{'column1': ['AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA',
'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA',
'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'BBB',
'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB',
'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB', 'BBB',
'BBB', 'BBB', 'BBB', 'BBB', 'BBB'],
'column1.5': ['V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V',
'V', 'W', 'W', 'W', 'W', 'W', 'W', 'W', 'W',
'W', 'W', 'V', 'V', 'V', 'V', 'V', 'V', 'V',
'V', 'V', 'V', 'X', 'X', 'X', 'X', 'X', 'X',
'X', 'X', 'X', 'X'],
'column2': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D',
'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'],
'column3': ['GG', 'HH', 'AA', 'BB', 'CC', 'DD', 'EE', 'FF',
'II', 'JJ', 'GG', 'HH', 'AA', 'BB', 'CC', 'DD',
'EE', 'FF', 'II', 'JJ', 'GG', 'HH', 'AA', 'BB',
'CC', 'DD', 'EE', 'FF', 'II', 'JJ', 'GG', 'HH',
'AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'II', 'JJ'],
'Jan-21': [nan, nan, 214.0, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
'Feb-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, 5.0, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, 21.0, nan, nan,
nan, nan, 32.0, nan, nan, nan, nan, nan, nan, nan],
'Mar-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, 6.0, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
'Apr-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
'May-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
'Jun-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
'Jul-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
'Aug-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, 132.0, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
'Sep-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, 14.0, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
'Oct-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
'Nov-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
'Dec-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]}
)
如您所知,df2 是 df1 的数据透视表.你可以改造它.
As you know df2 it is a pivot table of df1. And you may transform it.
melted = df2.melt(
id_vars=['column1', 'column2', 'column3'],
value_vars=[i for i in df2.columns if '-' in i],
var_name='column4', value_name='column5'
)
您的数据可能包含第 5 列中的不同数据.哪个是对的?
Your data may consist different data in column5. Which is right?
merged = df1.merge(melted, on=['column1', 'column2', 'column3', 'column4'], how='outer')
正如我所见这是不对的(数据框为空):
As I see It is not right (dataframe is empty):
merged[(merged.column5_x != merged.column5_y) # data in column5_x and column5_y is different
&(merged.column5_x.notnull()) # and not NaN
&(merged.column5_y.notnull()) # and not NaN
]
我认为删除重复项不是正确的方法,但是:
I think it is not proper way for dropping duplicates but:
merged.assign(
column_5=lambda x: x.column5_x if not x.column5_x.empty else x.column5_y) \
.drop(columns=['column5_x', 'column5_y'])
另一个版本:
(merged
.set_index(['column1', 'column2', 'column3', 'column4']) # move columns to index
.stack() # move column5_x and column5_y to one column
.reset_index() # bring to the initial form
.drop(columns='level_4') # drop column with names: column5_x, column5_y
.drop_duplicates() # what you want
.rename(columns={0:'column_5'})) # final step
这篇关于如何通过查找值转换为不同列的列的值来连接两个数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!