如何使用条件删除重复项 [英] How to drop duplicates using conditions
问题描述
我有以下DataFrame df
:
I have the following DataFrame df
:
如何删除重复项,同时每对item_id
和country_id
的每对重复项均保持level
的最小值.
How can I delete duplicates, while keeping the minimum value of level
per each duplicated pair of item_id
and country_id
.
+-----------+----------+---------------+
|item_id |country_id|level |
+-----------+----------+---------------+
| 312330| 13535670| 82|
| 312330| 13535670| 369|
| 312330| 13535670| 376|
| 319840| 69731210| 127|
| 319840| 69730600| 526|
| 311480| 69628930| 150|
| 311480| 69628930| 138|
| 311480| 69628930| 405|
+-----------+----------+---------------+
预期输出:
+-----------+----------+---------------+
|item_id |country_id|level |
+-----------+----------+---------------+
| 312330| 13535670| 82|
| 319840| 69731210| 127|
| 319840| 69730600| 526|
| 311480| 69628930| 138|
+-----------+----------+---------------+
我知道如何使用dropDuplicates
在没有条件的情况下删除重复项,但是我不知道如何针对我的特殊情况删除重复项.
I know how to delete duplicates without conditions using dropDuplicates
, but I don't know how to do it for my particular case.
推荐答案
方法之一是使用orderBy
(默认为升序),groupBy
和聚合first
One of the method is to use orderBy
(default is ascending order), groupBy
and aggregation first
import org.apache.spark.sql.functions.first
df.orderBy("level").groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)
您还可以通过以下方式使用.asc
升序和.desc
降序来定义顺序
You can define the order as well by using .asc
for ascending and .desc
for descending as below
df.orderBy($"level".asc).groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)
您也可以使用window
和row_number
函数进行操作
And you can do the operation using window
and row_number
function too as below
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("item_id", "country_id").orderBy($"level".asc)
import org.apache.spark.sql.functions.row_number
df.withColumn("rank", row_number().over(windowSpec)).filter($"rank" === 1).drop("rank").show()
这篇关于如何使用条件删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!