星火移除数据框重复行 [英] Spark remove duplicate rows from DataFrame
问题描述
假设我有像数据框:
val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)
我想删除列重复行一的基础上的B列中的值。即,如果有列的a重复行,我想保持与b的值越大的人。对于上述的例子,处理之后,我只需要
I want to remove duplicate rows for column "a" based on the value of column "b". i.e, if there are duplicate rows for column "a", I want to keep the one with larger value for "b". For the above example, after processing, I need only
{一:3,b的:9,C:22,D:12}
{"a":3, "b":9, "c":22, "d":12}
和
{一:1,B:4,C:23,D:12}
{"a":1, "b":4, "c":23, "d":12}
星火数据帧dropDuplicates API似乎并不支持这一点。随着RDD方法,我可以做一个地图()。reduceByKey()
,但什么数据框具体操作是有这样做吗?
Spark DataFrame dropDuplicates API doesn't seem to support this. With the RDD approach, I can do a map().reduceByKey()
, but what DataFrame specific operation is there to do this?
鸭preciate一些帮助,谢谢。
Appreciate some help, thanks.
推荐答案
您可以使用窗口函数sparksql实现这一目标。
You can use window function in sparksql to achieve this.
df.registerTempTable("x")
sqlContext.sql("SELECT a, b,c,d FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY a ORDER BY b DESC) rn FROM x) y WHERE rn = 1").collect
这将实现你所需要的。
了解更多关于窗口函数suupport的https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
这篇关于星火移除数据框重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!