Spark:删除具有不同值的重复行,但仅保留一行作为独特行 [英] Spark : remove duplicated rows with different values but keep only one row for distinctive row

查看:80
本文介绍了Spark:删除具有不同值的重复行,但仅保留一行作为独特行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的数据集 ds:

I have a dataset ds like this:

ds.show():

id1 | id2 | id3 | value  |
1   | 1   | 2   | tom    |
1   | 1   | 2   | tim    |
1   | 3   | 2   | tom    |
1   | 3   | 2   | tom    |
2   | 1   | 2   | mary   |

我想删除给定键 (id1,id2,id3) 的所有重复行(即第 1 行和第 2 行),但同时只为重复行保留一行相同的值(即第 3 行和第 4 行).预期输出为:

I want to remove all duplicate rows (i.e. row 1 and row 2) for the given keys (id1,id2,id3), but at the same time only keep one row for duplicated rows with same value (i.e. row 3 and row 4). The expected output is:

id1 | id2 | id3 | value  |
1   | 3   | 2   | tom    |
2   | 1   | 2   | mary   |

这里我应该删除第 1 行和第 2 行,因为我们有 2 个键组值.但是我们只保留了第 3 行和第 4 行的一行,因为值是相同的(而不是删除这两行)

here I should remove row 1 and row 2 because we have 2 values for the key group. But we keep only one row for row 3 and row 4 because the value is the same (instead of removing these two rows)

我尝试使用:

val df = Seq(
  (1, 1, 2, "tom"),
  (1, 1, 2, "tim"),
  (1, 3, 2, "tom"),
  (1, 3, 2, "tom"),
  (2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")

val window = Window.partitionBy("id1", "id2", "id3")

df.withColumn("count", count("value").over(window))
  .filter($"count" < 2)
  .drop("count")
  .show(false)

这是相关的问题:Spark:删除所有重复行

但它没有按预期工作,因为它会删除所有重复的行.

But it's not working as expected because it will remove all the duplicated rows.

我想这样做的原因是加入另一个数据集,而不是在同一个密钥组有多个名称时从这个数据集中添加信息

The reason that I want to do this is to join with another dataset, and not adding information from this dataset when we have multiple names for a same key group

推荐答案

分组前可以去掉重复项,如下图所示

You can drop duplicates before grouping, which gives you single record as below

df.dropDuplicates()
  .withColumn("count", count("value").over(window))
  .filter($"count" < 2)
  .drop("count")
  .show(false)

您还可以指定要检查重复的字段

You can also specify the fields to be checked for duplicate as

df.dropDuplicates("id1", "id2", "id3", "value")
  .withColumn("count", count("value").over(window))
  .filter($"count" < 2)
  .drop("count")
  .show(false)

输出:

+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1  |3  |2  |tom  |
|2  |1  |2  |mary |
+---+---+---+-----+

这篇关于Spark:删除具有不同值的重复行,但仅保留一行作为独特行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆