比较2个Excel文件并输出具有差异的Excel文件 [英] Compare 2 Excel files and output an Excel file with differences

查看:107
本文介绍了比较2个Excel文件并输出具有差异的Excel文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为简单起见,假设数据文件看起来像这样,按ID排序:

Assume for simplicity that the data files look like this, sorted on ID:

ID  | Data1  | Data2  | Data3   | Data4
199 |  Tim   |   55   |  work  |  $55
345 |  Joe   |   45   |  work  |  $34
356 |  Sam   |   23   |  uni   |  $12

每个文件具有超过100,000行和约50列。
我想将第二个文件与第一个文件进行比较,以获取新记录(新ID),进行编辑(ID匹配,但第2列或第4列已更改(Data1和Data3)和删除(第一个文件中的ID不存在)第二个文件)。

Each file has more than 100,000 rows and about 50 columns. I want to compare a 2nd file with the first for new records (new ID ), edits (IDs match but columns 2 or 4 have changed (Data1 and Data3), and Deletes (ID in first file does not exist in the 2nd file).

输出将显示在Excel文件中,第一列包含D,E或N(用于Delete,Edit和New),其余列

Output is to appear in an Excel file with the first column containing D, E or N (for Delete, Edit and New), and the rest of the columns being the same as the columns in the files being compared.

对于新记录,完整的新记录将出现在输出文件中。编辑两个记录都将显示在输出文件中,但仅显示那些已更改的字段
对于已删除的记录,完整的旧记录将显示在输出文件中。

For new records the full new record is to appear in the output file. For Edits both the records are to appear in the output file, but only those fields that have changed are to appear. For deleted records the full old record is to appear in the output file.

在处理文件时,我还希望在屏幕上显示以下输出:

I would also like the following output to the screen as the files are being processed:

Deletes: D: 199, Tim
Edits:   E: 345, Joe -> John
         E: 345, work -> xxx
New:     N: 999, Ami

谢谢。

推荐答案

我建议您阅读一些优秀 简介来了解熊猫的工作原理和原理,并使其适应您的特定需求

I suggest you read some of the excellent introductions to pandas to understand how and why this works and to adapt it to your specific needs

pandas.read_excel

import pandas as pd

filename1 = 'filename1.xlsx'
filename2 = 'filename2.xlsx'

df1 = pd.read_excel(filename1, index_col=0)
df2 = pd.read_excel(filename2, index_col=0)

df1和df2应该为 pandas.DataFrame s,ID为 ,索引为,第行作为或标题

df1 and df2 should be pandas.DataFrames with ID as index and the first row as columns or headers

pandas .merge

df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how='outer', sort=False, indicator=True)



选择更改



Selecting the changes

id_new = df_merged.index[df_merged['_merge'] == 'right_only'] 
id_deleted = df_merged.index[df_merged['_merge'] == 'left_only'] 
id_changed_data1 = df_merged.index[(df_merged['_merge'] == 'both') & (df_merged['Data1_x'] != df_merged['Data1_y'])]
id_changed_data3 = df_merged.index[(df_merged['_merge'] == 'both') & (df_merged['Data3_x'] != df_merged['Data3_y'])]

(或 Index )的更改,您可以根据需要对其进行格式化

This gives you lists (or an Index rather) of the changes, which you can format as you want

这篇关于比较2个Excel文件并输出具有差异的Excel文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆