PySpark在地图lambdas中序列化“自我"引用的对象? [英] PySpark serializing the 'self' referenced object in map lambdas?

查看:108
本文介绍了PySpark在地图lambdas中序列化“自我"引用的对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我了解,在使用Spark Scala界面时,我们必须小心,当只需要一个或两个属性时,不要不必要地序列化完整对象:(

As far as I understand, while using the Spark Scala interface we have to be careful not to unnecessarily serialize a full object when only one or two attributes are needed: (http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/)

使用PySpark时如何工作?如果我的课程如下:

How does this work when using PySpark? If I have a class as follows:

class C0(object):

  def func0(arg):
    ...

  def func1(rdd):
    result = rdd.map(lambda x: self.func0(x))

此结果是否腌制了整个C0实例?是的话,避免这种情况的正确方法是什么?

Does this result to pickling the full C0 instances? if yes what's the correct way to avoid it?

谢谢.

推荐答案

根据此文档,这的确会腌制整个C0实例:

This does result in pickling of the full C0 instance, according to this documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark.

为了避免这种情况,请执行以下操作:

In order to avoid it, do something like:

class C0(object):

  def func0(self, arg): # added self
    ...

  def func1(self, rdd): # added self
    func = self.func0
    result = rdd.map(lambda x: func(x))

故事的寓意:避免在地图通话中的任何地方使用self关键字.如果Spark可以在本地闭包中计算该函数,则它可以在序列化单个函数方面很聪明,但是对self的任何引用都会强制spark序列化整个对象.

Moral of the story: avoid the self keyword anywhere in a map call. Spark can be smart about serializing a single function if it can calculate the function in a local closure, but any reference to self forces spark to serialize your entire object.

这篇关于PySpark在地图lambdas中序列化“自我"引用的对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆