从pyspark中的数据框中删除重复项 [英] remove duplicates from a dataframe in pyspark
问题描述
我在本地弄乱了pyspark 1.4中的数据帧,并且在使dropplicate方法起作用时遇到了问题.不断返回错误"AttributeError:'列表'对象没有属性'dropDuplicates'".不太确定为什么我似乎遵循最新文档.似乎我缺少该功能的导入.
I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the drop duplicates method to work. Keeps returning the error "AttributeError: 'list' object has no attribute 'dropDuplicates'". Not quite sure why as I seem to be following the syntax in the latest documentation. Seems like I am missing an import for that functionality or something.
#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()
#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()
#dropping duplicates from the dataframe
df1.dropDuplicates().show()
推荐答案
这不是导入问题.您只需在错误的对象上调用.dropDuplicates()
.当sqlContext.createDataFrame(rdd1, ...)
的类是pyspark.sql.dataframe.DataFrame
时,应用.collect()
后,它是普通的Python list
,并且列表不提供dropDuplicates
方法.您想要的是这样的:
It is not an import problem. You simply call .dropDuplicates()
on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...)
is pyspark.sql.dataframe.DataFrame
, after you apply .collect()
it is a plain Python list
, and lists don't provide dropDuplicates
method. What you want is something like this:
(df1 = sqlContext
.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
.dropDuplicates())
df1.collect()
这篇关于从pyspark中的数据框中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!