合并两个 pandas 数据帧会导致“重复"操作.列 [英] Merging two pandas dataframes results in "duplicate" columns
问题描述
我正在尝试合并两个包含相同键列的数据框.其他一些列也具有相同的标题,尽管行数不相等,并且合并后,这些列将与原始标题复制"在一起,并带有后缀_x,_y等.
I'm trying to merge two dataframes which contain the same key column. Some of the other columns also have identical headers, although not an equal number of rows, and after merging these columns are "duplicated" with the original headers given a postscript _x, _y, etc.
有人知道如何让熊猫在下面的示例中删除重复的列吗?
Does anyone know how to get pandas to drop the duplicate columns in the example below?
这是我的python代码:
This is my python code:
import pandas as pd
holding_df = pd.read_csv('holding.csv')
invest_df = pd.read_csv('invest.csv')
merge_df = pd.merge(holding_df, invest_df, on='key', how='left').fillna(0)
merge_df.to_csv('merged.csv', index=False)
并且CSV文件包含以下内容:
And the CSV files contain this:
左数据帧(holding_df)的第一行
First rows of left-dataframe (holding_df)
key, dept_name, res_name, year, need, holding
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1
DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1
DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1
...
右侧数据框(invest_df)
Right-dataframe (invest_df)
key, dept_name, res_name, year, no_of_inv, inv_cost_wo_ice
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1000000
DeptA_ResB_2015, DeptA, ResB, 2015, 2, 6000000
DeptB_ResB_2015, DeptB, ResB, 2015, 1, 6000000
...
合并结果
key, dept_name_x, res_name_x, year_x, need, holding, dept_name_y, res_name_y, year_y, no_of_inv, inv_cost_wo_ice
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1, DeptA, ResA, 2015.0, 1.0, 1000000.0
DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2018, DeptA, ResA, 2018, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2019, DeptA, ResA, 2019, 1, 1, 0, 0, 0.0, 0.0, 0.0
...
推荐答案
您添加带有后缀'_x'和'_y'的其他列的原因是因为您要合并的列没有匹配的值,因此此冲突会产生其他列.在这种情况下,您需要删除其他"_y"列并重命名"_x"列:
The reason you have additional columns with suffixes '_x' and '_y' is because the columns you are merging do not have matching values so this clash produces additional columns. In that case you need to drop the additional '_y' columns and rename the '_x' columns:
In [145]:
# define our drop function
def drop_y(df):
# list comprehension of the cols that end with '_y'
to_drop = [x for x in df if x.endswith('_y')]
df.drop(to_drop, axis=1, inplace=True)
drop_y(merged)
merged
Out[145]:
key dept_name_x res_name_x year_x need holding \
0 DeptA_ResA_2015 DeptA ResA 2015 1 1
1 DeptA_ResA_2016 DeptA ResA 2016 1 1
2 DeptA_ResA_2017 DeptA ResA 2017 1 1
no_of_inv inv_cost_wo_ice
0 1 1000000
1 0 0
2 0 0
In [146]:
# func to rename '_x' cols
def rename_x(df):
for col in df:
if col.endswith('_x'):
df.rename(columns={col:col.rstrip('_x')}, inplace=True)
rename_x(merged)
merged
Out[146]:
key dept_name res_name year need holding no_of_inv \
0 DeptA_ResA_2015 DeptA ResA 2015 1 1 1
1 DeptA_ResA_2016 DeptA ResA 2016 1 1 0
2 DeptA_ResA_2017 DeptA ResA 2017 1 1 0
inv_cost_wo_ice
0 1000000
1 0
2 0
编辑 如果您将通用列添加到合并中,那么除非这些列上的匹配项不匹配,否则不应产生重复的列:
EDIT If you added the common columns to your merge then it shouldn't produce the duplicated columns unless the matches on those columns do not match:
merge_df = pd.merge(holding_df, invest_df, on=['key', 'dept_name', 'res_name', 'year'], how='left').fillna(0)
这篇关于合并两个 pandas 数据帧会导致“重复"操作.列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!