pickle.PicklingError:来自__newobj__ args的args [0]与hadoop python有错误的类 [英] pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python

查看:671
本文介绍了pickle.PicklingError:来自__newobj__ args的args [0]与hadoop python有错误的类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过spark删除停用词,代码如下

 来自nltk.corpus从pyspark.context导入停用词
从pyspark.sql.session导入SparkContext
import SparkSession
$ b $ sc = SparkContext('local')
spark = SparkSession(sc)
word_list = [我们自己,出去,超过,自己,相同,不应该,她,她会,什么, ,他妈的,是,这个,世界,太,谁,谁,谁,你的,你自己,你自己 b $ b wordlist = spark.createDataFrame([word_list])。rdd

def stopwords_delete(word_list):
filtered_words = []
print word_list



word_list中的单词:
打印单词
如果单词不在stopwords.words('english'):
filtered_words.append(word)



filtered_words = wordlist.map(stopwords_delete)
print(filtered_words)

,我得到的错误如下:


pickle.PicklingError:来自 newobj的args [0] args具有错误的类

我不知道为什么,有人可以帮我。

提前致谢

您使用的map仅包含一行,每个单词作为一列。因此,整个rdd类型的行被传递给stopwords_delete函数,并在for循环内,正试图匹配rdd停用词,它失败。尝试像这样,

  filtered_words = stopwords_delete(wordlist。 flatMap(lambda x:x).collect())
print(filtered_words)

I得到这个输出为filtered_words,

  [不应该,她会,他妈的,世界 ,who's] 

另外,在你的函数中包含一个return。



另一种方法是,你可以使用列表理解来代替stopwords_delete功能, p>

  filtered_words = wordlist.flatMap(lambda x:[i for i in x if if not in stopwords.words('english')] ).collect()


I am trying to I am tring to delete stop words via spark,the code is as follow

from nltk.corpus import stopwords
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext('local')
spark = SparkSession(sc)
word_list=["ourselves","out","over", "own", "same" ,"shan't" ,"she", "she'd", "what", "the", "fuck", "is", "this","world","too","who","who's","whom","yours","yourself","yourselves"]

wordlist=spark.createDataFrame([word_list]).rdd

def stopwords_delete(word_list):
    filtered_words=[]
    print word_list



    for word in word_list:
        print word
        if word not in stopwords.words('english'):
            filtered_words.append(word)



filtered_words=wordlist.map(stopwords_delete)
print(filtered_words)

and I got the error as follow:

pickle.PicklingError: args[0] from newobj args has the wrong class

I don't know why,can somebody help me.
Thanks in advance

解决方案

You are using map over a rdd which has only one row and each word as a column.so, the entire row of rdd which is of type is passed to stopwords_delete fuction and in the for loop within that, is trying to match rdd to stopwords and it fails.Try like this,

filtered_words=stopwords_delete(wordlist.flatMap(lambda x:x).collect())
print(filtered_words)

I got this output as filtered_words,

["shan't", "she'd", 'fuck', 'world', "who's"]

Also, include a return in your function.

Another way, you could use list comprehension to replace the stopwords_delete fuction,

filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect()

这篇关于pickle.PicklingError:来自__newobj__ args的args [0]与hadoop python有错误的类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆