Pyspark-从数据框中删除重复项以保持最后的外观 [英] Pyspark - remove duplicates from dataframe keeping the last appearance

查看：300 发布时间：2020/5/24 3:01:30 pandas dataframe pyspark

本文介绍了Pyspark-从数据框中删除重复项以保持最后的外观的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试对Spark数据帧进行重复数据删除，仅保留最新的外观. 重复项包含三个变量:

I'm trying to dedupe a spark dataframe leaving only the latest appearance. The duplication is in three variables:

NAME
ID
DOB

我在Pandas取得了以下成就:

I succeeded in Pandas with the following:

df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

但是在火花中，我尝试了以下操作:

But in spark I tried the following:

df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')

我收到此错误:

TypeError: dropDuplicates() got an unexpected keyword argument 'keep'

有什么想法吗?

推荐答案

感谢您的帮助. 我遵循了您的指示，但结果与预期不符:

Thanks for your help. I followed your directives but the outcome was not as expected:

d1 = [('Bob', '10', '1542189668', '0', '0'),  ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)

结果是:

+-----+---+----------+------+--------+    
|NAME |ID |DOB       |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob  |10 |1542189668|0     |0       |
|Alice|10 |1425298030|154   |39      |
+-----+---+----------+------+--------+

显示带有损坏数据的鲍勃".

Showing the "Bob" with corrupted data.

最后，我改变了方法，将DF转换为Pandas，然后又恢复为火花:

Finally, I changed my approach and converted the DF to Pandas and then back to spark:

p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'),  ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)

这终于带来了正确的鲍勃":

This finally brought the correct "Bob":

+-----+---+----------+------+--------+
|NAME |ID |DOB       |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154   |39      |
|Bob  |10 |1542189668|178   |42      |
+-----+---+----------+------+--------+

当然，我仍然希望有一个纯粹的Spark解决方案，但是缺少索引似乎对Spark来说是个问题.

Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.

谢谢！

这篇关于Pyspark-从数据框中删除重复项以保持最后的外观的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark-从数据框中删除重复项以保持最后的外观 [英] Pyspark - remove duplicates from dataframe keeping the last appearance

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pyspark-从数据框中删除重复项以保持最后的外观 [英] Pyspark - remove duplicates from dataframe keeping the last appearance

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭