比较2个Excel文件并输出具有差异的Excel文件 [英] Compare 2 Excel files and output an Excel file with differences
问题描述
为简单起见,假设数据文件看起来像这样,按ID排序:
Assume for simplicity that the data files look like this, sorted on ID:
ID | Data1 | Data2 | Data3 | Data4
199 | Tim | 55 | work | $55
345 | Joe | 45 | work | $34
356 | Sam | 23 | uni | $12
每个文件具有超过100,000行和约50列。
我想将第二个文件与第一个文件进行比较,以获取新记录(新ID),进行编辑(ID匹配,但第2列或第4列已更改(Data1和Data3)和删除(第一个文件中的ID不存在)第二个文件)。
Each file has more than 100,000 rows and about 50 columns. I want to compare a 2nd file with the first for new records (new ID ), edits (IDs match but columns 2 or 4 have changed (Data1 and Data3), and Deletes (ID in first file does not exist in the 2nd file).
输出将显示在Excel文件中,第一列包含D,E或N(用于Delete,Edit和New),其余列
Output is to appear in an Excel file with the first column containing D, E or N (for Delete, Edit and New), and the rest of the columns being the same as the columns in the files being compared.
对于新记录,完整的新记录将出现在输出文件中。编辑两个记录都将显示在输出文件中,但仅显示那些已更改的字段
对于已删除的记录,完整的旧记录将显示在输出文件中。
For new records the full new record is to appear in the output file. For Edits both the records are to appear in the output file, but only those fields that have changed are to appear. For deleted records the full old record is to appear in the output file.
在处理文件时,我还希望在屏幕上显示以下输出:
I would also like the following output to the screen as the files are being processed:
Deletes: D: 199, Tim
Edits: E: 345, Joe -> John
E: 345, work -> xxx
New: N: 999, Ami
谢谢。
推荐答案
我建议您阅读一些优秀 简介来了解熊猫的工作原理和原理,并使其适应您的特定需求
I suggest you read some of the excellent introductions to pandas to understand how and why this works and to adapt it to your specific needs
import pandas as pd
filename1 = 'filename1.xlsx'
filename2 = 'filename2.xlsx'
df1 = pd.read_excel(filename1, index_col=0)
df2 = pd.read_excel(filename2, index_col=0)
df1和df2应该为 pandas.DataFrame
s,ID为 ,索引为
,第行作为
列
或标题
df1 and df2 should be pandas.DataFrame
s with ID
as index
and the first row as columns
or headers
df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how='outer', sort=False, indicator=True)
选择更改
Selecting the changes
id_new = df_merged.index[df_merged['_merge'] == 'right_only']
id_deleted = df_merged.index[df_merged['_merge'] == 'left_only']
id_changed_data1 = df_merged.index[(df_merged['_merge'] == 'both') & (df_merged['Data1_x'] != df_merged['Data1_y'])]
id_changed_data3 = df_merged.index[(df_merged['_merge'] == 'both') & (df_merged['Data3_x'] != df_merged['Data3_y'])]
(或 Index
)的更改,您可以根据需要对其进行格式化
This gives you lists (or an Index
rather) of the changes, which you can format as you want
这篇关于比较2个Excel文件并输出具有差异的Excel文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!