根据列的子集合并和更新数据框 [英] Merge and update dataframes based on a subset of their columns
本文介绍了根据列的子集合并和更新数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想知道是否存在最快的代码来替换两个for循环,假设df的大小非常大.以我的实际情况而言,每个数据框都是200行25列.
I wonder whether there is the fastest code to replace the two for loops, assuming the df size is very large. In my real case, each dataframe is 200 rows and 25 columns.
data_df1 = np.array([['Name','Unit','Attribute','Date'],['a','A',1,2014],['b','B',2,2015],['c','C',3,2016],\
['d','D',4,2017],['e','E',5,2018]])
data_df2 = np.array([['Name','Unit','Date'],['a','F',2019],['b','G',2020],['e','H',2021],\
['f','I',2022]])
df1 = pd.DataFrame(data=data_df1)
print('df1:')
print(df1)
df2 = pd.DataFrame(data=data_df2)
print('df2:')
print(df2)
row_df1 = [1,2,5]
col_df1 = [1,3]
row_df2 = [1,2,3]
col_df2 = [1,2]
for i in range(0,len(row_df1)):
for j in range(0, len(col_df1)):
df1.set_value(row_df1[i],col_df1[j], df2.loc[row_df2[i],col_df2[j]])
print('df1 after operation:')
print(df1)
预期输出:
df1:
0 1 2 3
0 Name Unit Attribute Date
1 a A 1 2014
2 b B 2 2015
3 c C 3 2016
4 d D 4 2017
5 e E 5 2018
df2:
0 1 2
0 Name Unit Date
1 a F 2019
2 b G 2020
3 e H 2021
4 f I 2022
df1 after operation:
0 1 2 3
0 Name Unit Attribute Date
1 a F 1 2019
2 b G 2 2020
3 c C 3 2016
4 d D 4 2017
5 e H 5 2021
我尝试过:
df1.loc[[1,2,5],[1,3]] = df2.loc[[1,2,3],[1,2]]
print('df1:')
print(df1)
print('df2:')
print(df2)
但是结果如下.有意想不到的楠.
but the outcome is the following. There are unexpected Nan.
df1:
0 1 2 3
0 Name Unit Attribute Date
1 a F 1 NaN
2 b G 2 NaN
3 c C 3 2016
4 d D 4 2017
5 e NaN 5 NaN
df2:
0 1 2
0 Name Unit Date
1 a F 2019
2 b G 2020
3 e H 2021
4 f I 2022
在此先感谢您的帮助.
推荐答案
一些清洁工作:
def clean_df(df):
df.columns = df.iloc[0]
df.columns.name = None
df = df.iloc[1:].reset_index()
return df
df1 = clean_df(df1)
df1
index Name Unit Attribute Date
0 1 a A 1 2014
1 2 b B 2 2015
2 3 c C 3 2016
3 4 d D 4 2017
4 5 e E 5 2018
df2 = clean_df(df2)
df2
index Name Unit Date
0 1 a F 2019
1 2 b G 2020
2 3 e H 2021
3 4 f I 2022
使用merge
,并指定on=Name
,因此不考虑其他列.
Use merge
, specifying on=Name
, so the other columns are not considered.
cols = ['Name', 'Unit_y', 'Attribute', 'Date_y']
df1 = df1.merge(df2, how='left', on='Name')[cols]\
.rename(columns=lambda x: x.split('_')[0]).fillna(df1)
df1
Name Unit Attribute Date
0 a F 1 2019
1 b G 2 2020
2 c C 3 2016
3 d D 4 2017
4 e H 5 2021
这篇关于根据列的子集合并和更新数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文