比较两个DataFrame并并排输出它们的差异 [英] Compare two DataFrames and output their differences side-by-side
问题描述
我正试图突出显示两个数据框之间的确切变化。
I am trying to highlight exactly what changed between two dataframes.
假设我有两个Python Pandas数据框:
Suppose I have two Python Pandas dataframes:
"StudentRoster Jan-1":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.11 False Graduated
113 Zoe 4.12 True
"StudentRoster Jan-2":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.21 False Graduated
113 Zoe 4.12 False On vacation
我的目标是输出HTML该表:
My goal is to output an HTML table that:
- 标识已更改的行(可以是int,float,boolean,string)
-
O将具有相同,旧和新值的行输出(理想情况下放入HTML表中),以便消费者可以清楚地看到两个数据框之间的变化:
- Identifies rows that have changed (could be int, float, boolean, string)
Outputs rows with same, OLD and NEW values (ideally into an HTML table) so the consumer can clearly see what changed between two dataframes:
"StudentRoster Difference Jan-1 - Jan-2":
id Name score isEnrolled Comment
112 Nick was 1.11| now 1.21 False Graduated
113 Zoe 4.12 was True | now False was "" | now "On vacation"
我想我可以进行逐行和逐列的比较,但是还有更简单的方法吗?
I suppose I could do a row by row and column by column comparison, but is there an easier way?
推荐答案
第一部分类似于君士坦丁,您可以获得布尔值为空的布尔值*:
The first part is similar to Constantine, you can get the boolean of which rows are empty*:
In [21]: ne = (df1 != df2).any(1)
In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool
然后我们可以查看哪些条目已更改:
Then we can see which entries have changed:
In [23]: ne_stacked = (df1 != df2).stack()
In [24]: changed = ne_stacked[ne_stacked]
In [25]: changed.index.names = ['id', 'col']
In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool
此处第一项是索引,第二项是已更改的列
In [27]: difference_locations = np.where(df1 != df2)
In [28]: changed_from = df1.values[difference_locations]
In [29]: changed_to = df2.values[difference_locations]
In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation
*注意: df1
和 df2
在此处共享相同的索引很重要。为了克服这种歧义,您可以确保仅使用 df1.index& df2.index
,但我想将其保留为练习。
* Note: it's important that df1
and df2
share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index
, but I think I'll leave that as an exercise.
这篇关于比较两个DataFrame并并排输出它们的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!