Spark 从 DataFrame 中删除重复行 [英] Spark remove duplicate rows from DataFrame

查看：150 发布时间：2021/11/14 22:31:08 scala apache-spark dataframe apache-spark-sql

本文介绍了Spark 从 DataFrame 中删除重复行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个 DataFrame 像:

Assume that I am having a DataFrame like :

val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)

我想根据b"列的值删除a"列的重复行.即，如果列a"有重复的行，我想保留b"值较大的行.对于上面的例子，经过处理，我只需要

I want to remove duplicate rows for column "a" based on the value of column "b". i.e, if there are duplicate rows for column "a", I want to keep the one with larger value for "b". For the above example, after processing, I need only

{"a":3, "b":9, "c":22, "d":12}

{"a":3, "b":9, "c":22, "d":12}

和

{"a":1, "b":4, "c":23, "d":12}

{"a":1, "b":4, "c":23, "d":12}

Spark DataFrame dropDuplicates API 似乎不支持这一点.使用 RDD 方法，我可以做一个 map().reduceByKey()，但是有什么 DataFrame 特定的操作可以做到这一点?

Spark DataFrame dropDuplicates API doesn't seem to support this. With the RDD approach, I can do a map().reduceByKey(), but what DataFrame specific operation is there to do this?

感谢一些帮助，谢谢.

推荐答案

可以使用sparksql中的窗口函数来实现.

You can use window function in sparksql to achieve this.

df.registerTempTable("x")
sqlContext.sql("SELECT a, b,c,d  FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY a ORDER BY b DESC) rn FROM x) y WHERE rn = 1").collect

这将实现您的需求.阅读有关窗口函数支持的更多信息 https:///databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

This will achieve what you need. Read more about Window function suupport https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

这篇关于Spark 从 DataFrame 中删除重复行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark 从 DataFrame 中删除重复行 [英] Spark remove duplicate rows from DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark 从 DataFrame 中删除重复行 [英] Spark remove duplicate rows from DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭