比较两个DataFrame并并排输出它们的差异 [英] Compare two DataFrames and output their differences side-by-side

查看:1157
本文介绍了比较两个DataFrame并并排输出它们的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图突出显示两个数据框之间的确切变化。

I am trying to highlight exactly what changed between two dataframes.

假设我有两个Python Pandas数据框:

Suppose I have two Python Pandas dataframes:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

我的目标是输出HTML该表:

My goal is to output an HTML table that:


  1. 标识已更改的行(可以是int,float,boolean,string)

  2. O将具有相同,旧和新值的行输出(理想情况下放入HTML表中),以便消费者可以清楚地看到两个数据框之间的变化:

  1. Identifies rows that have changed (could be int, float, boolean, string)
  2. Outputs rows with same, OLD and NEW values (ideally into an HTML table) so the consumer can clearly see what changed between two dataframes:

"StudentRoster Difference Jan-1 - Jan-2":  
id   Name   score                    isEnrolled           Comment
112  Nick   was 1.11| now 1.21       False                Graduated
113  Zoe    4.12                     was True | now False was "" | now   "On   vacation"


我想我可以进行逐行和逐列的比较,但是还有更简单的方法吗?

I suppose I could do a row by row and column by column comparison, but is there an easier way?

推荐答案

第一部分类似于君士坦丁,您可以获得布尔值为空的布尔值*:

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

然后我们可以查看哪些条目已更改:

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

此处第一项是索引,第二项是已更改的列

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

*注意: df1 df2 在此处共享相同的索引很重要。为了克服这种歧义,您可以确保仅使用 df1.index& df2.index ,但我想将其保留为练习。

* Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.

这篇关于比较两个DataFrame并并排输出它们的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆