根据列的子集合并和更新数据框 [英] Merge and update dataframes based on a subset of their columns

查看:56
本文介绍了根据列的子集合并和更新数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否存在最快的代码来替换两个for循环,假设df的大小非常大.以我的实际情况而言,每个数据框都是200行25列.

I wonder whether there is the fastest code to replace the two for loops, assuming the df size is very large. In my real case, each dataframe is 200 rows and 25 columns.

data_df1 = np.array([['Name','Unit','Attribute','Date'],['a','A',1,2014],['b','B',2,2015],['c','C',3,2016],\
                 ['d','D',4,2017],['e','E',5,2018]])
data_df2 = np.array([['Name','Unit','Date'],['a','F',2019],['b','G',2020],['e','H',2021],\
                 ['f','I',2022]])
df1 = pd.DataFrame(data=data_df1)
print('df1:')
print(df1)
df2 = pd.DataFrame(data=data_df2)
print('df2:')
print(df2)
row_df1 = [1,2,5]
col_df1 = [1,3]
row_df2 = [1,2,3]
col_df2 = [1,2]
for i in range(0,len(row_df1)):
    for j in range(0, len(col_df1)):
        df1.set_value(row_df1[i],col_df1[j], df2.loc[row_df2[i],col_df2[j]])
print('df1 after operation:')
print(df1)

预期输出:

df1:
      0     1          2     3
0  Name  Unit  Attribute  Date
1     a     A          1  2014
2     b     B          2  2015
3     c     C          3  2016
4     d     D          4  2017
5     e     E          5  2018
df2:
      0     1     2
0  Name  Unit  Date
1     a     F  2019
2     b     G  2020
3     e     H  2021
4     f     I  2022
df1 after operation:
      0     1          2     3
0  Name  Unit  Attribute  Date
1     a     F          1  2019
2     b     G          2  2020
3     c     C          3  2016
4     d     D          4  2017
5     e     H          5  2021

我尝试过:

df1.loc[[1,2,5],[1,3]] = df2.loc[[1,2,3],[1,2]]
print('df1:')
print(df1)
print('df2:')
print(df2)

但是结果如下.有意想不到的楠.

but the outcome is the following. There are unexpected Nan.

df1:
      0     1          2     3
0  Name  Unit  Attribute  Date
1     a     F          1   NaN
2     b     G          2   NaN
3     c     C          3  2016
4     d     D          4  2017
5     e   NaN          5   NaN
df2:
      0     1     2
0  Name  Unit  Date
1     a     F  2019
2     b     G  2020
3     e     H  2021
4     f     I  2022

在此先感谢您的帮助.

推荐答案

一些清洁工作:

def clean_df(df):
    df.columns = df.iloc[0]
    df.columns.name = None        
    df = df.iloc[1:].reset_index()

    return df

df1 = clean_df(df1)
df1
   index Name Unit Attribute  Date
0      1    a    A         1  2014
1      2    b    B         2  2015
2      3    c    C         3  2016
3      4    d    D         4  2017
4      5    e    E         5  2018

df2 = clean_df(df2)
df2    
   index Name Unit  Date
0      1    a    F  2019
1      2    b    G  2020
2      3    e    H  2021
3      4    f    I  2022


使用merge,并指定on=Name,因此不考虑其他列.


Use merge, specifying on=Name, so the other columns are not considered.

cols = ['Name', 'Unit_y', 'Attribute', 'Date_y']
df1 = df1.merge(df2, how='left', on='Name')[cols]\
              .rename(columns=lambda x: x.split('_')[0]).fillna(df1)

df1
  Name Unit Attribute  Date
0    a    F         1  2019
1    b    G         2  2020
2    c    C         3  2016
3    d    D         4  2017
4    e    H         5  2021

这篇关于根据列的子集合并和更新数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆