Pyspark py4j PickleException:"预计零论据建设ClassDict&QUOT的; [英] Pyspark py4j PickleException: "expected zero arguments for construction of ClassDict"
问题描述
这问题是针对熟悉py4j人 - 而且可以帮助解决一个酸洗错误。我想一个方法添加到接受namedtuple的RDD,做了一些工作,并在RDD的形式返回结果的pyspark PythonMLLibAPI。
This question is directed towards persons familiar with py4j - and can help to resolve a pickling error. I am trying to add a method to the pyspark PythonMLLibAPI that accepts an RDD of a namedtuple, does some work, and returns a result in the form of an RDD.
此方法在PYthonMLLibAPI.trainALSModel()方法,它类似于为蓝本的现有的相关部分是:
This method is modeled after the PYthonMLLibAPI.trainALSModel() method, whose analogous existing relevant portions are:
def trainALSModel(
ratingsJRDD: JavaRDD[Rating],
.. )
的现有的用于新的code型蟒蛇评级类是:
The existing python Rating class used to model the new code is:
class Rating(namedtuple("Rating", ["user", "product", "rating"])):
def __reduce__(self):
return Rating, (int(self.user), int(self.product), float(self.rating))
下面是企图因此,这里有相关的类:
Here is the attempt So here are the relevant classes:
的新的Python类pyspark.mllib.clustering.MatrixEntry:
New python class pyspark.mllib.clustering.MatrixEntry:
from collections import namedtuple
class MatrixEntry(namedtuple("MatrixEntry", ["x","y","weight"])):
def __reduce__(self):
return MatrixEntry, (long(self.x), long(self.y), float(self.weight))
的新的方法的 foobarRDD 的在PythonMLLibAPI:
New method foobarRDD In PythonMLLibAPI:
def foobarRdd(
data: JavaRDD[MatrixEntry]): RDD[FooBarResult] = {
val rdd = data.rdd.map { d => FooBarResult(d.i, d.j, d.value, d.i * 100 + d.j * 10 + d.value)}
rdd
}
现在,让我们尝试一下:
Now let us try it out:
from pyspark.mllib.clustering import MatrixEntry
def convert_to_MatrixEntry(tuple):
return MatrixEntry(*tuple)
from pyspark.mllib.clustering import *
pic = PowerIterationClusteringModel(2)
tups = [(1,2,3),(4,5,6),(12,13,14),(15,7,8),(16,17,16.5)]
trdd = sc.parallelize(map(convert_to_MatrixEntry,tups))
# print out the RDD on python side just for validation
print "%s" %(repr(trdd.collect()))
from pyspark.mllib.common import callMLlibFunc
pic = callMLlibFunc("foobar", trdd)
结果的相关部分:
Relevant portions of results:
[(1,2)=3.0, (4,5)=6.0, (12,13)=14.0, (15,7)=8.0, (16,17)=16.5]
这说明输入RDD是整体。然而,酸洗并不满意:
which shows the input rdd is 'whole'. However the pickling was unhappy:
5/04/27 21:15:44 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 14)
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict
(for pyspark.mllib.clustering.MatrixEntry)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1167)
at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1166)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1523)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1523)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
下面是一个可视化的蟒蛇调用堆栈跟踪:
Below is a visual of the python invocation stack trace:
推荐答案
我为我所用MLlib,并且事实证明,我在我的功能之一返回一个错误的数据类型有同样的错误。现在对返回值的简单石膏,之后的作品。这可能不是你正在寻找的答案,但它至少为方向遵循提示。
I had the same error as I was using MLlib, and it turned out that I had returned a wrong datatype in one of my functions. It now works after a simple cast on the returned value. This might not be the answer you're seeking but it is at least a hint for the direction to follow.
这篇关于Pyspark py4j PickleException:"预计零论据建设ClassDict&QUOT的;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!