在 pyspark 数据框中删除连续的重复项 [英] Drop consecutive duplicates in a pyspark dataframe

查看：24 发布时间：2021/11/14 21:48:28 apache-spark pyspark pyspark-sql

本文介绍了在 pyspark 数据框中删除连续的重复项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有一个像这样的数据框:

Having a dataframe like:

## +---+---+
## | id|num|
## +---+---+
## |  2|3.0|
## |  3|6.0|
## |  3|2.0|
## |  3|1.0|
## |  2|9.0|
## |  4|7.0|
## +---+---+

我想删除连续的重复，并获得:

and I want to remove the consecutive repetitions, and obtain:

## +---+---+
## | id|num|
## +---+---+
## |  2|3.0|
## |  3|6.0|
## |  2|9.0|
## |  4|7.0|
## +---+---+

我在 Pandas 中找到了这样做的方法，但在 Pyspark 中没有找到.

I found ways of doing this in Pandas but nothing in Pyspark.

推荐答案

答案应该如您所愿，但可能还有一些优化空间:

The answer should work as you desired, however there might be room for some optimization:

from pyspark.sql.window import Window as W
test_df = spark.createDataFrame([
    (2,3.0),(3,6.0),(3,2.0),(3,1.0),(2,9.0),(4,7.0)
    ], ("id", "num"))
test_df = test_df.withColumn("idx", monotonically_increasing_id())  # create temporary ID because window needs an ordered structure
w = W.orderBy("idx")
get_last= when(lag("id", 1).over(w) == col("id"), False).otherwise(True) # check if the previous row contains the same id

test_df.withColumn("changed",get_last).filter(col("changed")).select("id","num").show() # only select the rows with a changed ID

输出:

+---+---+
| id|num|
+---+---+
|  2|3.0|
|  3|6.0|
|  2|9.0|
|  4|7.0|
+---+---+

这篇关于在 pyspark 数据框中删除连续的重复项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 pyspark 数据框中删除连续的重复项 [英] Drop consecutive duplicates in a pyspark dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 pyspark 数据框中删除连续的重复项 [英] Drop consecutive duplicates in a pyspark dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭