星火移除数据框重复行 [英] Spark remove duplicate rows from DataFrame

查看:180
本文介绍了星火移除数据框重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有像数据框:

val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)

我想删除列重复行一的基础上的B列中的值。即,如果有列的a重复行,我想保持与b的值越大的人。对于上述的例子,处理之后,我只需要

I want to remove duplicate rows for column "a" based on the value of column "b". i.e, if there are duplicate rows for column "a", I want to keep the one with larger value for "b". For the above example, after processing, I need only

{一:3,b的:9,C:22,D:12}

{"a":3, "b":9, "c":22, "d":12}

{一:1,B:4,C:23,D:12}

{"a":1, "b":4, "c":23, "d":12}

星火数据帧dropDuplicates API似乎并不支持这一点。随着RDD方法,我可以做一个地图()。reduceByKey(),但什么数据框具体操作是有这样做吗?

Spark DataFrame dropDuplicates API doesn't seem to support this. With the RDD approach, I can do a map().reduceByKey(), but what DataFrame specific operation is there to do this?

鸭preciate一些帮助,谢谢。

Appreciate some help, thanks.

推荐答案

您可以使用窗口函数sparksql实现这一目标。

You can use window function in sparksql to achieve this.

df.registerTempTable("x")
sqlContext.sql("SELECT a, b,c,d  FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY a ORDER BY b DESC) rn FROM x) y WHERE rn = 1").collect

这将实现你所需要的。
了解更多关于窗口函数suupport的https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

这篇关于星火移除数据框重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆