执行程序上的Spark对象(单例)序列化 [英] Spark Object (singleton) serialization on executors

查看:265
本文介绍了执行程序上的Spark对象(单例)序列化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定我想要实现的目标是否可行.我所知道的是,我正在从执行程序访问单例对象,以确保在每个执行程序上仅调用一次其构造函数.这种模式已经得到证明,并且可以在我的代码库中的类似用例中正常使用.

I am not sure that what I want to achieve is possible. What I do know is I am accessing a singleton object from an executor to ensure it's constructor has been called only once on each executor. This pattern is already proven and works as expected for similar use cases in my code base.

但是,我想知道的是,是否可以在驱动程序上初始化该对象后再运送该对象.在这种情况下, 访问ExecutorAccessedObject.y时,理想情况下,它不会调用println,而只是返回该值.这是一个高度简化的版本,实际上,我想在驱动程序上调用某些外部系统,因此在执行程序上访问时,它不会重新调用该外部系统.我可以在执行器上将@transient lazy val x重新初始化一次,因为它将保存一个无法序列化的连接池.

However, What I would like to know is if I can ship the object after it has been initialized on the driver. In this scenario, when accesing ExecutorAccessedObject.y, ideally it would not call the println but just return the value. This is a highly simplified version, in reality, I would like to make a call to some external system on the driver, so when accessed on the executor, it will not re-call that external system. I am ok with @transient lazy val x to be reinitialized once on the executors, as that will hold a connection pool which cannot be serialized.

object ExecutorAccessedObject extends Serializable {
  @transient lazy val x: Int = {
    println("Ok with initialzing this on the executor. I.E. database connection pool")
    1
  }

  val y: Int = {
    // call some external system to return a value.
    // I do not want to call the external system from the executor
    println(
      """
        |Idealy, this would not be printed on the executor.
        |return value 1 without re initializing
      """)
    1
  }
  println("The constructor will be initialized Once on each executor")
}


someRdd.mapPartitions { part =>
  ExecutorAccessedObject
  ExecutorAccessedObject.x // first time accessed should re-evaluate
  ExecutorAccessedObject.y // idealy, never re-evaluate and return 1
  part
}

我也尝试使用广播变量来解决这个问题,但是我不确定如何在单例对象中访问广播变量.

I attempted to solve this with broadcast variables as well, but I am unsure how to access the broadcast variable within the singleton object.

推荐答案

我想知道的是,是否可以在驱动程序上初始化该对象后再运送该对象.

What I would like to know is if I can ship the object after it has been initialized on the driver.

您不能. Objects作为单例,永远不会交付给执行者.每当首次访问对象时,都会在本地进行初始化.

You cannot. Objects, as singletons, are never shipped to executors. There initialized locally, whenever objects is accessed for the first time.

如果调用的结果是可序列化的,则只需将其单独传递(作为对ExecutorAccessedObject的参数(隐式或显式)或使ExecutorAccessedObject可变(并添加所需的同步))即可.

If the result of the call is serializable, just pass it alone, either as an arguments to the ExecutorAccessedObject (implicitly or explicitly) or making ExecutorAccessedObject mutable (and adding required synchronization).

这篇关于执行程序上的Spark对象(单例)序列化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆