pickle.PicklingError:来自__newobj__ args的args [0]与hadoop python有错误的类 [英] pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python
问题描述
我试图通过spark删除停用词,代码如下
来自nltk.corpus从pyspark.context导入停用词
从pyspark.sql.session导入SparkContext
import SparkSession
$ b $ sc = SparkContext('local')
spark = SparkSession(sc)
word_list = [我们自己,出去,超过,自己,相同,不应该,她,她会,什么, ,他妈的,是,这个,世界,太,谁,谁,谁,你的,你自己,你自己 b $ b wordlist = spark.createDataFrame([word_list])。rdd
def stopwords_delete(word_list):
filtered_words = []
print word_list
word_list中的单词:
打印单词
如果单词不在stopwords.words('english'):
filtered_words.append(word)
filtered_words = wordlist.map(stopwords_delete)
print(filtered_words)
,我得到的错误如下:
pickle.PicklingError:来自 newobj的args [0] args具有错误的类
我不知道为什么,有人可以帮我。
提前致谢
您使用的map仅包含一行,每个单词作为一列。因此,整个rdd类型的行被传递给stopwords_delete函数,并在for循环内,正试图匹配rdd停用词,它失败。尝试像这样,
filtered_words = stopwords_delete(wordlist。 flatMap(lambda x:x).collect())
print(filtered_words)
I得到这个输出为filtered_words,
[不应该,她会,他妈的,世界 ,who's]
另外,在你的函数中包含一个return。
另一种方法是,你可以使用列表理解来代替stopwords_delete功能, p>
filtered_words = wordlist.flatMap(lambda x:[i for i in x if if not in stopwords.words('english')] ).collect()
I am trying to I am tring to delete stop words via spark,the code is as follow
from nltk.corpus import stopwords
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
word_list=["ourselves","out","over", "own", "same" ,"shan't" ,"she", "she'd", "what", "the", "fuck", "is", "this","world","too","who","who's","whom","yours","yourself","yourselves"]
wordlist=spark.createDataFrame([word_list]).rdd
def stopwords_delete(word_list):
filtered_words=[]
print word_list
for word in word_list:
print word
if word not in stopwords.words('english'):
filtered_words.append(word)
filtered_words=wordlist.map(stopwords_delete)
print(filtered_words)
and I got the error as follow:
pickle.PicklingError: args[0] from newobj args has the wrong class
I don't know why,can somebody help me.
Thanks in advance
You are using map over a rdd which has only one row and each word as a column.so, the entire row of rdd which is of type is passed to stopwords_delete fuction and in the for loop within that, is trying to match rdd to stopwords and it fails.Try like this,
filtered_words=stopwords_delete(wordlist.flatMap(lambda x:x).collect())
print(filtered_words)
I got this output as filtered_words,
["shan't", "she'd", 'fuck', 'world', "who's"]
Also, include a return in your function.
Another way, you could use list comprehension to replace the stopwords_delete fuction,
filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect()
这篇关于pickle.PicklingError:来自__newobj__ args的args [0]与hadoop python有错误的类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!