从 PySpark 中的数据框中删除重复项 [英] Remove duplicates from a dataframe in PySpark

查看：31 发布时间：2021/12/22 21:23:26 python apache-spark pyspark duplicates pyspark-dataframes

本文介绍了从 PySpark 中的数据框中删除重复项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在本地处理 pyspark 1.4 中的数据帧，并且在使 dropDuplicates 方法工作时遇到问题.它不断返回错误:

I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error:

"AttributeError: 'list' 对象没有属性 'dropDuplicates'"

"AttributeError: 'list' object has no attribute 'dropDuplicates'"

不太清楚为什么，因为我似乎遵循最新文档.

Not quite sure why as I seem to be following the syntax in the latest documentation.

#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()

#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()

#dropping duplicates from the dataframe
df1.dropDuplicates().show()

推荐答案

这不是导入问题.您只需在错误的对象上调用 .dropDuplicates() 即可.而 sqlContext.createDataFrame(rdd1, ...) 的类是 pyspark.sql.dataframe.DataFrame，应用 .collect() 后它是一个普通的 Python list，并且列表不提供 dropDuplicates 方法.你想要的是这样的:

It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:

 (df1 = sqlContext
     .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
     .dropDuplicates())

 df1.collect()

这篇关于从 PySpark 中的数据框中删除重复项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从 PySpark 中的数据框中删除重复项 [英] Remove duplicates from a dataframe in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从 PySpark 中的数据框中删除重复项 [英] Remove duplicates from a dataframe in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭