比较两个 Spark 数据帧 [英] Compare two Spark dataframes
问题描述
Spark 数据框 1 -:
Spark dataframe 1 -:
+------+-------+---------+----+---+-------+
|city |product|date |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193 |
|city 1|prod 2 |8/25/2017|50 |687|201 |
|city 1|prod 3 |9/9/2017 |236 |431|169 |
|city 2|prod 1 |9/28/2017|358 |975|193 |
|city 2|prod 2 |8/24/2017|50 |687|201 |
|city 3|prod 3 |9/8/2017 |236 |431|169 |
+------+-------+---------+----+---+-------+
Spark 数据框 2 -:
Spark dataframe 2 -:
+------+-------+---------+----+---+-------+
|city |product|date |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193 |
|city 1|prod 2 |8/25/2017|50 |687|201 |
|city 1|prod 3 |9/9/2017 |230 |430|160 |
|city 1|prod 4 |9/27/2017|350 |90 |190 |
|city 2|prod 2 |8/24/2017|50 |687|201 |
|city 3|prod 3 |9/8/2017 |236 |431|169 |
|city 3|prod 4 |9/18/2017|230 |431|169 |
+------+-------+---------+----+---+-------+
请找出适用于上述给定火花数据帧 1 和火花数据帧 2 的以下条件的火花数据帧,
Please find out spark dataframe for following conditions applied on above given spark dataframe 1 and spark dataframe 2,
- 已删除的记录
- 新记录
- 没有变化的记录
有变化的记录
- Deleted Records
- New Records
- Records with no changes
Records with changes
这里的组合键是城市"、产品"、日期".
Here key of comprision are 'city', 'product', 'date'.
我们需要不使用 Spark SQL 的解决方案.
we need solution without using Spark SQL.
推荐答案
我不确定是否找到删除和修改的记录,但您可以使用 except 函数来获取差异
I am not sure about finding the deleted and modified records but you can use except function to get the difference
df2.except(df1)
这将返回在 dataframe2 中添加或修改的行或有更改的记录.输出:
This returns the rows that has been added or modified in dataframe2 or record with changes. Output:
+------+-------+---------+----+---+-------+
| city|product| date|sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 3| prod 4|9/18/2017| 230|431| 169|
|city 1| prod 4|9/27/2017| 350| 90| 190|
|city 1| prod 3|9/9/2017 | 230|430| 160|
+------+-------+---------+----+---+-------+
您也可以尝试使用 join 和 filter 来获取已更改和未更改的数据为
You can also try with join and filter to get the changed and unchanged data as
df1.join(df2, Seq("city","product", "date"), "left").show(false)
df1.join(df2, Seq("city","product", "date"), "right").show(false)
希望这会有所帮助!
这篇关于比较两个 Spark 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!