比较相同数据框的2个版本后如何获取修改后的行 [英] How to fetch the modified rows after comparing 2 versions of same data frame

查看:43
本文介绍了比较相同数据框的2个版本后如何获取修改后的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个脚本,用于今天读取CSV文件并将其与昨天的相同数据文件进行比较.

I am creating a script for reading a CSV file today and comparing it with the yesterday's file of same data.

此CSV每天一次上传到服务器上,我想比较今天和昨天的文件.

This CSV gets uploaded on the server once daily, and I want to compare today's and yesterday's files.

我想通过比较这两个文件来了解被修改,插入或删除的行.

I want to know the rows which were modified, Inserted or Deleted by comparing these 2 files.

我已经完成了插入和删除"操作,但是我在修改"中苦苦挣扎.

I have done it for Inserts and Deletes, but I am struggling with Modify.

下面是获取INSERT和DELETE数据帧的代码:

Below is the code for getting INSERT and DELETE Dataframes:

def getInsDel(df_old,df_new,key):
    #concatinating old and new data to generate comparisons
    df = pd.concat([df_new,df_old])
    df= df.reset_index(drop = True)


    #doing a group by for getting the frequency of each key
    print('Grouping data for frequency of key...')
    df_gpby = df.groupby(list(df.columns))
    idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
    df_delta = df.reindex(idx)
    df_delta_freq = df_delta.groupby(key).size().reset_index(name='Freq')

    #Filtering data for frequency = 1, since these will be the target records for DELETE and INSERT 
    print('Creating data frame to get records with Frequency = 1  ...')
    filter = df_delta_freq['Freq']==1
    df_delta_freq_ins_del = df_delta_freq.where(filter)


    #Dropping row with NULL
    df_delta_freq_ins_del = df_delta_freq_ins_del.dropna()


    print('Creating data frames of Insert and Deletes  ...')
    #Creating INSERT dataFrame 
    df_ins = pd.merge(df_new, 
                     df_delta_freq_ins_del[key],
                     on = key,
                     how = 'inner'
                    )

    #Creating DELETE dataFrame
    df_del = pd.merge(df_old, 
                     df_delta_freq_ins_del[key],
                     on = key,
                     how = 'inner'
                    )

    print('size of INSERT file: ' + str(df_ins.shape))
    print('size of DELETE file: ' + str(df_del.shape))


    return df_ins,df_del

例如, 旧数据是:

ID  Name  X  Y
1   ABC   1  2
2   DEF   2  3
3   HIJ   3  4

新数据集为:

ID  Name   X   Y
2   DEF    2   3
3   HIJ    55  42
4   KLM    4   5

其中ID是密钥.

Modified_DataFrame应该为:

Modified_DataFrame should be:

ID   Name   X   Y
3    HIJ   55   42

注意:此处看到ID-1,2将在DELETE数据框中,而ID-4将在INSERT数据框中.我已经完成了这部分工作,根据键进行分组,然后根据这些键的频率进行过滤.如果频率为1,那么我知道它是DELETE还是INSERT.

NOTE: See here that ID - 1,2 would be in the DELETE dataframe and ID - 4 would be in INSERT DataFrame. This part I have done by grouping based on key and then filtering out based on the frequency of these keys. If the Frequency is 1, then I know its either DELETE or INSERT.

但是,如何做才能获得Modified_dataFrame?

However, What can be done to get the Modified_dataFrame?

从注释中的链接引用后,经过一些修改,我添加了MODIFY_DataFrame,如下所示:

After taking reference from the link in comments, after some modifications, I have added the MODIFY_DataFrame as below:

    df_all = pd.concat([df_new,df_old],ignore_index=True)
    cols_list = list(df_all)
    modifcations = df_all.drop_duplicates(subset=cols_list, keep='last')
    mod_keys = modifcations[modifcations[key].duplicated() == True][key]

    df_mod = pd.merge(df_new, 
                     mod_keys[key],
                     on = key,
                     how = 'inner'
                    )

    print('size of MODIFY file: ' + str(df_mod.shape))

谢谢!

推荐答案

从注释中的链接获取引用后,经过一些修改,我添加了MODIFY_DataFrame,如下所示:

After taking reference from the link in comments, after some modifications, I have added the MODIFY_DataFrame as below:

    df_all = pd.concat([df_new,df_old],ignore_index=True)
    cols_list = list(df_all)
    modifcations = df_all.drop_duplicates(subset=cols_list, keep='last')
    mod_keys = modifcations[modifcations[key].duplicated() == True][key]

    df_mod = pd.merge(df_new, 
                     mod_keys[key],
                     on = key,
                     how = 'inner'
                    )

    print('size of MODIFY file: ' + str(df_mod.shape))

谢谢!

这篇关于比较相同数据框的2个版本后如何获取修改后的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆