并排输出两只 pandas 数据帧的差异 - 突出差异 [英] Outputting difference in two pandas dataframes side by side - highlighting the difference

查看:124
本文介绍了并排输出两只 pandas 数据帧的差异 - 突出差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



假设我有两个Python Pandas数据框



我试图突出显示两个数据框之间的变化。 pre> StudentRoster Jan-1:
id姓名得分已被保留评论
111 Jack 2.17 True他迟到上课
112 Nick 1.11 False Graduated
113 Zoe 4.12 True

StudentRoster Jan-2:
id姓名成绩被保留评论
111 Jack 2.17 True他迟到上课
112 Nick 1.21假毕业
113 Zoe 4.12假假假

我的目标是输出一个html表


  1. 标识已更改的行(可以是int,float,boolean,string)

  2. 输出具有相同的OLD和NEW值(理想情况下是HTML表)的行,以便消费者可以清楚地看到两个数据帧之间发生了什么变化。



    StudentRoster差异1月1日 - 1月-2日:

    id姓名得分被保留评论
    112 Nick is 1.11 |现在1.21假毕业
    113 Zoe 4.12是True |现在False是|现在假期


我想我可以逐行排列比较...但是有一个更简单的方法?

解决方案

第一部分与Constantine类似,可以得到哪些行为空的布尔值*:

 在[21]中:ne =(df1!= df2).any(1)

在[22]中:ne
Out [22]:
0 False
1 True
2 True
dtype:bool

然后我们可以看到哪些条目已更改:

 <$ c $在[23]中:ne_stacked =(df1!= df2).stack()

在[24]:changed = ne_stacked [ne_stacked]

在[25] :changed.index.names = ['id','col']

在[26]中:更改
输出[26]:
id col
1分数True
2 isEnrolled True
注释True
dtype:bool

这里第一个条目是索引,第二个是已更改的列。

 在[27]中:difference_locations = np.where(df1!= df2)

在[28]中:changed_from = df1.values [difference_locations]

在[29]:changed_to = df2.values [difference_locations]

在[30]中: pd.DataFrame({'from':changed_from,'to':changed_to},index = changed.index)
Out [30]:

id col
1得分1.11 1.21
2 isEnrolled True False
评论无假期

*注意:重要的是, df1 df2 在此共享相同的索引。为了克服这个歧义,您可以确保只使用 df1.index& amp; df2.index ,但我想我会把它作为一个练习。


I am trying to highlight exactly what changed between two dataframes.

Suppose I have two Python Pandas dataframe

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

My goal is to output an html table that:

  1. Identifies rows that have changed (could be int, float, boolean, string)
  2. Outputs rows with same, OLD and NEW values (ideally into an HTML table) so the consumer can clearly see what changed between two dataframes.

    "StudentRoster Difference Jan-1 - Jan-2":
    id Name score isEnrolled Comment 112 Nick was 1.11| now 1.21 False Graduated 113 Zoe 4.12 was True | now False was "" | now "On vacation"

I suppose I could do a row by row and column by column comparison.. but is there a easier way?

解决方案

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

Here the first entry is the index and the second the columns which has been changed.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

* Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.

这篇关于并排输出两只 pandas 数据帧的差异 - 突出差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆