如何使用条件删除重复项 [英] How to drop duplicates using conditions

查看：23 发布时间：2021/11/14 22:00:51 scala apache-spark apache-spark-sql

本文介绍了如何使用条件删除重复项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下数据帧 df:

如何删除重复项，同时保持每个重复的 item_id 和 country_id 对的 level 最小值.

How can I delete duplicates, while keeping the minimum value of level per each duplicated pair of item_id and country_id.

+-----------+----------+---------------+                                        
|item_id    |country_id|level          |
+-----------+----------+---------------+
|     312330|  13535670|             82|
|     312330|  13535670|            369|
|     312330|  13535670|            376|
|     319840|  69731210|            127|
|     319840|  69730600|            526|
|     311480|  69628930|            150|
|     311480|  69628930|            138|
|     311480|  69628930|            405|
+-----------+----------+---------------+

预期输出:

+-----------+----------+---------------+                                        
|item_id    |country_id|level          |
+-----------+----------+---------------+
|     312330|  13535670|             82|
|     319840|  69731210|            127|
|     319840|  69730600|            526|
|     311480|  69628930|            138|
+-----------+----------+---------------+

我知道如何使用 dropDuplicates 无条件地删除重复项，但我不知道如何针对我的特定情况执行此操作.

I know how to delete duplicates without conditions using dropDuplicates, but I don't know how to do it for my particular case.

推荐答案

其中一种方法是使用orderBy(默认为升序)、groupBy和聚合<代码>第一个

One of the method is to use orderBy (default is ascending order), groupBy and aggregation first

import org.apache.spark.sql.functions.first
df.orderBy("level").groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)

您也可以使用 .asc 表示升序和 .desc 表示降序来定义顺序，如下所示

You can define the order as well by using .asc for ascending and .desc for descending as below

df.orderBy($"level".asc).groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)

您也可以使用window 和row_number 函数进行操作，如下所示

And you can do the operation using window and row_number function too as below

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("item_id", "country_id").orderBy($"level".asc)

import org.apache.spark.sql.functions.row_number
df.withColumn("rank", row_number().over(windowSpec)).filter($"rank" === 1).drop("rank").show()

这篇关于如何使用条件删除重复项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用条件删除重复项 [英] How to drop duplicates using conditions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用条件删除重复项 [英] How to drop duplicates using conditions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭