比较两个数据帧并并排输出它们的差异 [英] Compare two DataFrames and output their differences side-by-side

查看:27
本文介绍了比较两个数据帧并并排输出它们的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图突出显示两个数据帧之间发生的变化.

I am trying to highlight exactly what changed between two dataframes.

假设我有两个 Python Pandas 数据框:

Suppose I have two Python Pandas dataframes:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

我的目标是输出一个 HTML 表格:

My goal is to output an HTML table that:

  1. 标识已更改的行(可以是 int、float、boolean、string)
  2. 输出具有相同、OLD 和 NEW 值的行(最好是在 HTML 表中),以便消费者可以清楚地看到两个数据帧之间发生了什么变化:

  1. Identifies rows that have changed (could be int, float, boolean, string)
  2. Outputs rows with same, OLD and NEW values (ideally into an HTML table) so the consumer can clearly see what changed between two dataframes:

"StudentRoster Difference Jan-1 - Jan-2":  
id   Name   score                    isEnrolled           Comment
112  Nick   was 1.11| now 1.21       False                Graduated
113  Zoe    4.12                     was True | now False was "" | now   "On   vacation"

我想我可以逐行和逐列进行比较,但有没有更简单的方法?

I suppose I could do a row by row and column by column comparison, but is there an easier way?

推荐答案

第一部分和Constantine类似,可以得到哪些行为空的boolean*:

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

然后我们可以看到哪些条目发生了变化:

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

这里的第一个条目是索引,第二个条目是已更改的列.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

* 注意:df1df2 在这里共享相同的索引很重要.为了克服这种歧义,您可以确保只使用 df1.index & 查看共享标签.df2.index,但我想我会把它留作练习.

* Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.

这篇关于比较两个数据帧并并排输出它们的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆