PySpark在地图lambdas中序列化“自我"引用的对象? [英] PySpark serializing the 'self' referenced object in map lambdas?
问题描述
据我了解,在使用Spark Scala界面时,我们必须小心,当只需要一个或两个属性时,不要不必要地序列化完整对象:(
As far as I understand, while using the Spark Scala interface we have to be careful not to unnecessarily serialize a full object when only one or two attributes are needed: (http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/)
使用PySpark时如何工作?如果我的课程如下:
How does this work when using PySpark? If I have a class as follows:
class C0(object):
def func0(arg):
...
def func1(rdd):
result = rdd.map(lambda x: self.func0(x))
此结果是否腌制了整个C0实例?是的话,避免这种情况的正确方法是什么?
Does this result to pickling the full C0 instances? if yes what's the correct way to avoid it?
谢谢.
推荐答案
This does result in pickling of the full C0 instance, according to this documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark.
为了避免这种情况,请执行以下操作:
In order to avoid it, do something like:
class C0(object):
def func0(self, arg): # added self
...
def func1(self, rdd): # added self
func = self.func0
result = rdd.map(lambda x: func(x))
故事的寓意:避免在地图通话中的任何地方使用self
关键字.如果Spark可以在本地闭包中计算该函数,则它可以在序列化单个函数方面很聪明,但是对self
的任何引用都会强制spark序列化整个对象.
Moral of the story: avoid the self
keyword anywhere in a map call. Spark can be smart about serializing a single function if it can calculate the function in a local closure, but any reference to self
forces spark to serialize your entire object.
这篇关于PySpark在地图lambdas中序列化“自我"引用的对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!