PySpark在地图lambdas中序列化“自我"引用的对象? [英] PySpark serializing the 'self' referenced object in map lambdas?

查看：108 发布时间：2020/5/27 20:24:18 python lambda apache-spark pyspark pickle

本文介绍了PySpark在地图lambdas中序列化“自我"引用的对象?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

据我了解，在使用Spark Scala界面时，我们必须小心，当只需要一个或两个属性时，不要不必要地序列化完整对象:(

As far as I understand, while using the Spark Scala interface we have to be careful not to unnecessarily serialize a full object when only one or two attributes are needed: (http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/)

使用PySpark时如何工作?如果我的课程如下:

How does this work when using PySpark? If I have a class as follows:

class C0(object):

  def func0(arg):
    ...

  def func1(rdd):
    result = rdd.map(lambda x: self.func0(x))

此结果是否腌制了整个C0实例?是的话，避免这种情况的正确方法是什么?

Does this result to pickling the full C0 instances? if yes what's the correct way to avoid it?

谢谢.

推荐答案

根据此文档，这的确会腌制整个C0实例:

This does result in pickling of the full C0 instance, according to this documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark.

为了避免这种情况，请执行以下操作:

In order to avoid it, do something like:

class C0(object):

  def func0(self, arg): # added self
    ...

  def func1(self, rdd): # added self
    func = self.func0
    result = rdd.map(lambda x: func(x))

故事的寓意:避免在地图通话中的任何地方使用self关键字.如果Spark可以在本地闭包中计算该函数，则它可以在序列化单个函数方面很聪明，但是对self的任何引用都会强制spark序列化整个对象.

Moral of the story: avoid the self keyword anywhere in a map call. Spark can be smart about serializing a single function if it can calculate the function in a local closure, but any reference to self forces spark to serialize your entire object.

这篇关于PySpark在地图lambdas中序列化“自我"引用的对象?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark在地图lambdas中序列化“自我"引用的对象? [英] PySpark serializing the 'self' referenced object in map lambdas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PySpark在地图lambdas中序列化“自我"引用的对象? [英] PySpark serializing the &#39;self&#39; referenced object in map lambdas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

PySpark在地图lambdas中序列化“自我"引用的对象? [英] PySpark serializing the 'self' referenced object in map lambdas?

登录关闭