比较相同数据框的2个版本后如何获取修改后的行 [英] How to fetch the modified rows after comparing 2 versions of same data frame
问题描述
我正在创建一个脚本,用于今天读取CSV文件并将其与昨天的相同数据文件进行比较.
I am creating a script for reading a CSV file today and comparing it with the yesterday's file of same data.
此CSV每天一次上传到服务器上,我想比较今天和昨天的文件.
This CSV gets uploaded on the server once daily, and I want to compare today's and yesterday's files.
我想通过比较这两个文件来了解被修改,插入或删除的行.
I want to know the rows which were modified, Inserted or Deleted by comparing these 2 files.
我已经完成了插入和删除"操作,但是我在修改"中苦苦挣扎.
I have done it for Inserts and Deletes, but I am struggling with Modify.
下面是获取INSERT和DELETE数据帧的代码:
Below is the code for getting INSERT and DELETE Dataframes:
def getInsDel(df_old,df_new,key):
#concatinating old and new data to generate comparisons
df = pd.concat([df_new,df_old])
df= df.reset_index(drop = True)
#doing a group by for getting the frequency of each key
print('Grouping data for frequency of key...')
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
df_delta = df.reindex(idx)
df_delta_freq = df_delta.groupby(key).size().reset_index(name='Freq')
#Filtering data for frequency = 1, since these will be the target records for DELETE and INSERT
print('Creating data frame to get records with Frequency = 1 ...')
filter = df_delta_freq['Freq']==1
df_delta_freq_ins_del = df_delta_freq.where(filter)
#Dropping row with NULL
df_delta_freq_ins_del = df_delta_freq_ins_del.dropna()
print('Creating data frames of Insert and Deletes ...')
#Creating INSERT dataFrame
df_ins = pd.merge(df_new,
df_delta_freq_ins_del[key],
on = key,
how = 'inner'
)
#Creating DELETE dataFrame
df_del = pd.merge(df_old,
df_delta_freq_ins_del[key],
on = key,
how = 'inner'
)
print('size of INSERT file: ' + str(df_ins.shape))
print('size of DELETE file: ' + str(df_del.shape))
return df_ins,df_del
例如, 旧数据是:
ID Name X Y
1 ABC 1 2
2 DEF 2 3
3 HIJ 3 4
新数据集为:
ID Name X Y
2 DEF 2 3
3 HIJ 55 42
4 KLM 4 5
其中ID是密钥.
Modified_DataFrame应该为:
Modified_DataFrame should be:
ID Name X Y
3 HIJ 55 42
注意:此处看到ID-1,2将在DELETE数据框中,而ID-4将在INSERT数据框中.我已经完成了这部分工作,根据键进行分组,然后根据这些键的频率进行过滤.如果频率为1,那么我知道它是DELETE还是INSERT.
NOTE: See here that ID - 1,2 would be in the DELETE dataframe and ID - 4 would be in INSERT DataFrame. This part I have done by grouping based on key and then filtering out based on the frequency of these keys. If the Frequency is 1, then I know its either DELETE or INSERT.
但是,如何做才能获得Modified_dataFrame?
However, What can be done to get the Modified_dataFrame?
从注释中的链接引用后,经过一些修改,我添加了MODIFY_DataFrame,如下所示:
After taking reference from the link in comments, after some modifications, I have added the MODIFY_DataFrame as below:
df_all = pd.concat([df_new,df_old],ignore_index=True)
cols_list = list(df_all)
modifcations = df_all.drop_duplicates(subset=cols_list, keep='last')
mod_keys = modifcations[modifcations[key].duplicated() == True][key]
df_mod = pd.merge(df_new,
mod_keys[key],
on = key,
how = 'inner'
)
print('size of MODIFY file: ' + str(df_mod.shape))
谢谢!
推荐答案
从注释中的链接获取引用后,经过一些修改,我添加了MODIFY_DataFrame,如下所示:
After taking reference from the link in comments, after some modifications, I have added the MODIFY_DataFrame as below:
df_all = pd.concat([df_new,df_old],ignore_index=True)
cols_list = list(df_all)
modifcations = df_all.drop_duplicates(subset=cols_list, keep='last')
mod_keys = modifcations[modifcations[key].duplicated() == True][key]
df_mod = pd.merge(df_new,
mod_keys[key],
on = key,
how = 'inner'
)
print('size of MODIFY file: ' + str(df_mod.shape))
谢谢!
这篇关于比较相同数据框的2个版本后如何获取修改后的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!