使用 Python Pandas 比较具有不同行数的两个 Excel 文件 [英] Compare two Excel files that have a different number of rows using Python Pandas

查看:50
本文介绍了使用 Python Pandas 比较具有不同行数的两个 Excel 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Python 3.7,我想比较两个具有相同列(140 列)但行数不同的 Excel 文件,我查看了网站,但没有找到解决方案我的情况!

I'm using Python 3.7 , and I want to compare two Excel file that have the same columns (140 columns) but with a different number of rows, I looked on the website , but I didn't find a solution for my case!

这是一个例子:

df1 (old report) : 

id       qte     d1    d2

A        10      23    35  

B        43      63    63

C       15       61    62

df2 (new report) : 

id       qte     d1    d2

A        20      23    35  

C       15       61    62

E       38       62    16

F       63       20    51

结果应该是:

  • 修改行必须为黄色,修改的值必须为红色

  • the modify rows must be in yellow and the value modified in red color

绿色的新行

删除的行红色

id qte d1 d2

id qte d1 d2

A 20 23 35

C 15 61 62

B 43 63 63

E 38 62 16

F 63 20 51

代码:

import pandas as pd
import numpy as np

df1= pd.read_excel(r'C .....\data novembre.xlsx','Sheet1',na_values=['NA'])
df2= pd.read_excel(r'C.....\data decembre.xlsx','Sheet1',na_values=['NA'])
merged_data=df1.merge(df2, left_on = 'id', right_on = 'id', how = 'outer')

加入数据虽然不是我想要的!

Joining the data though is not want I want to have!

我刚刚开始学习 Python,所以我真的需要帮助!

I'm just starting to learn Python so I really need help!

推荐答案

一个 excel diff 可以很快变成一个时髦的野兽,但我们应该能够通过一些 concats 和布尔语句来做到这一点.

an excel diff can quickly become a funky beast, but we should be able to do this with some concats and boolean statements.

假设您的数据帧被称为 df1, df2

assuming your dataframes are called df1, df2

df1 = df1.set_index('id')
df2 = df2.set_index('id')

df3 = pd.concat([df1,df2],sort=False)
df3a = df3.stack().groupby(level=[0,1]).unique().unstack(1).copy()


df3a.loc[~df3a.index.isin(df2.index),'status'] = 'deleted' # if not in df2 index then deleted
df3a.loc[~df3a.index.isin(df1.index),'status'] = 'new'     # if not in df1 index then new
idx = df3.stack().groupby(level=[0,1]).nunique() # get modified cells. 
df3a.loc[idx.mask(idx <= 1).dropna().index.get_level_values(0),'status'] = 'modified'
df3a['status'] = df3a['status'].fillna('same') # assume that anything not fufilled by above rules is the same.


print(df3a)

      d1    d2       qte    status
id                                
A   [23]  [35]  [10, 20]  modified
B   [63]  [63]      [43]   deleted
C   [61]  [62]      [15]      same
E   [62]  [16]      [38]       new
F   [20]  [51]      [63]       new

如果您不介意将所有数据类型转换为字符串对性能的影响,那么这可以工作.不过我不推荐它,使用事实或缓慢变化的维度模式来保存此类数据,您将来会感谢自己.

if you don't mind the performance hit of turning all your datatypes to strings then this could work. I dont' recommend it though, use a fact or slow changing dimension schema to hold such data, you'll thank your self in the future.

df3a.stack().explode().astype(str).groupby(level=[0,1]).agg('-->'.join).unstack(1)

    d1  d2      qte    status
id                           
A   23  35  10-->20  modified
B   63  63       43   deleted
C   61  62       15      same
E   62  16       38       new
F   20  51       63       new

这篇关于使用 Python Pandas 比较具有不同行数的两个 Excel 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆