比较两个数据帧并并排输出它们的差异 [英] Compare two DataFrames and output their differences side-by-side
问题描述
我试图突出显示两个数据帧之间发生的变化.
I am trying to highlight exactly what changed between two dataframes.
假设我有两个 Python Pandas 数据框:
Suppose I have two Python Pandas dataframes:
"StudentRoster Jan-1":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.11 False Graduated
113 Zoe 4.12 True
"StudentRoster Jan-2":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.21 False Graduated
113 Zoe 4.12 False On vacation
我的目标是输出一个 HTML 表格:
My goal is to output an HTML table that:
- 标识已更改的行(可以是 int、float、boolean、string)
输出具有相同、OLD 和 NEW 值的行(最好是在 HTML 表中),以便消费者可以清楚地看到两个数据帧之间发生了什么变化:
- Identifies rows that have changed (could be int, float, boolean, string)
Outputs rows with same, OLD and NEW values (ideally into an HTML table) so the consumer can clearly see what changed between two dataframes:
"StudentRoster Difference Jan-1 - Jan-2":
id Name score isEnrolled Comment
112 Nick was 1.11| now 1.21 False Graduated
113 Zoe 4.12 was True | now False was "" | now "On vacation"
我想我可以逐行和逐列进行比较,但有没有更简单的方法?
I suppose I could do a row by row and column by column comparison, but is there an easier way?
推荐答案
第一部分和Constantine类似,可以得到哪些行为空的boolean*:
The first part is similar to Constantine, you can get the boolean of which rows are empty*:
In [21]: ne = (df1 != df2).any(1)
In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool
然后我们可以看到哪些条目发生了变化:
Then we can see which entries have changed:
In [23]: ne_stacked = (df1 != df2).stack()
In [24]: changed = ne_stacked[ne_stacked]
In [25]: changed.index.names = ['id', 'col']
In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool
这里的第一个条目是索引,第二个条目是已更改的列.
In [27]: difference_locations = np.where(df1 != df2)
In [28]: changed_from = df1.values[difference_locations]
In [29]: changed_to = df2.values[difference_locations]
In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation
* 注意:df1
和 df2
在这里共享相同的索引很重要.为了克服这种歧义,您可以确保只使用 df1.index & 查看共享标签.df2.index
,但我想我会把它留作练习.
* Note: it's important that df1
and df2
share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index
, but I think I'll leave that as an exercise.
这篇关于比较两个数据帧并并排输出它们的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!