比较两个 Spark 数据帧 [英] Compare two Spark dataframes

查看:29
本文介绍了比较两个 Spark 数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark 数据框 1 -:

Spark dataframe 1 -:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193    |
|city 1|prod 2 |8/25/2017|50  |687|201    |
|city 1|prod 3 |9/9/2017 |236 |431|169    |
|city 2|prod 1 |9/28/2017|358 |975|193    |
|city 2|prod 2 |8/24/2017|50  |687|201    |
|city 3|prod 3 |9/8/2017 |236 |431|169    |
+------+-------+---------+----+---+-------+

Spark 数据框 2 -:

Spark dataframe 2 -:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 1|prod 1 |9/29/2017|358 |975|193    |
|city 1|prod 2 |8/25/2017|50  |687|201    |
|city 1|prod 3 |9/9/2017 |230 |430|160    |
|city 1|prod 4 |9/27/2017|350 |90 |190    |
|city 2|prod 2 |8/24/2017|50  |687|201    |
|city 3|prod 3 |9/8/2017 |236 |431|169    |
|city 3|prod 4 |9/18/2017|230 |431|169    |
+------+-------+---------+----+---+-------+

请找出适用于上述给定火花数据帧 1 和火花数据帧 2 的以下条件的火花数据帧,

Please find out spark dataframe for following conditions applied on above given spark dataframe 1 and spark dataframe 2,

  1. 已删除的记录
  2. 新记录
  3. 没有变化的记录
  4. 有变化的记录

  1. Deleted Records
  2. New Records
  3. Records with no changes
  4. Records with changes

这里的组合键是城市"、产品"、日期".

Here key of comprision are 'city', 'product', 'date'.

我们需要不使用 Spark SQL 的解决方案.

we need solution without using Spark SQL.

推荐答案

我不确定是否找到删除和修改的记录,但您可以使用 except 函数来获取差异

I am not sure about finding the deleted and modified records but you can use except function to get the difference

df2.except(df1)

这将返回在 dataframe2 中添加或修改的行或有更改的记录.输出:

This returns the rows that has been added or modified in dataframe2 or record with changes. Output:

+------+-------+---------+----+---+-------+
|  city|product|     date|sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 3| prod 4|9/18/2017| 230|431|    169|
|city 1| prod 4|9/27/2017| 350| 90|    190|
|city 1| prod 3|9/9/2017 | 230|430|    160|
+------+-------+---------+----+---+-------+

您也可以尝试使用 join 和 filter 来获取已更改和未更改的数据为

You can also try with join and filter to get the changed and unchanged data as

df1.join(df2, Seq("city","product", "date"), "left").show(false)
df1.join(df2, Seq("city","product", "date"), "right").show(false)

希望这会有所帮助!

这篇关于比较两个 Spark 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆