删除与某些行相关的所有重复行 [英] Remove all rows that are duplicates with respect to some rows

查看：30 发布时间：2021/11/14 21:47:51 python pyspark apache-spark-sql pyspark-sql

本文介绍了删除与某些行相关的所有重复行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我见过几个这样的问题，但对我的情况没有满意的答案.这是一个示例数据帧:

I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:

+------+-----+----+
|    id|value|type|
+------+-----+----+
|283924|  1.5|   0|
|283924|  1.5|   1|
|982384|  3.0|   0|
|982384|  3.0|   1|
|892383|  2.0|   0|
|892383|  2.5|   1|
+------+-----+----+

我只想通过 "id" 和 "value" 列识别重复项，然后删除所有实例.

I want to identify duplicates by just the "id" and "value" columns, and then remove all instances.

在这种情况下:

第 1 行和第 2 行是重复的(我们再次忽略类型"列)
第 3 行和第 4 行是重复的，因此只有第 5 行和第 4 行.6 应该保留:

输出将是:

+------+-----+----+
|    id|value|type|
+------+-----+----+
|892383|  2.5|   1|
|892383|  2.0|   0|
+------+-----+----+

我试过了

df.dropDuplicates(subset = ['id', 'value'], keep = False)

但是保持"功能不在 PySpark 中(就像在 pandas.DataFrame.drop_duplicates.

But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates.

我还能怎么做?

推荐答案

你可以使用窗口函数来做到这一点

You can do that using the window functions

from pyspark.sql import Window, functions as F
df.withColumn(
  'fg', 
  F.count("id").over(Window.partitionBy("id", "value"))
).where("fg = 1").drop("fg").show()

这篇关于删除与某些行相关的所有重复行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

删除与某些行相关的所有重复行 [英] Remove all rows that are duplicates with respect to some rows

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

删除与某些行相关的所有重复行 [英] Remove all rows that are duplicates with respect to some rows

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭