如何在火花中对每个执行器执行一次操作 [英] How to perform one operation on each executor once in spark

查看:83
本文介绍了如何在火花中对每个执行器执行一次操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在S3中存储了一个weka模型,大小约为400MB. 现在,我有一些记录,我要在该记录上运行模型并执行预测.

I have a weka model stored in S3 which is of size around 400MB. Now, I have some set of record on which I want to run the model and perform prediction.

为了进行预测,我尝试过的是

For performing prediction, What I have tried is,

  1. 下载模型并将其作为静态对象加载到驱动程序上,并将其广播给所有执行者.对预测RDD执行映射操作. ---->无法正常工作,就像在Weka中执行预测一样,需要修改模型对象,并且广播需要只读副本.

  1. Download and load the model on driver as a static object , broadcast it to all executors. Perform a map operation on prediction RDD. ----> Not working, as in Weka for performing prediction, model object needs to be modified and broadcast require a read-only copy.

下载模型并将其作为静态对象加载到驱动程序上,并在每次映射操作中将其发送给执行程序. ----->工作(效率不高,因为在每个地图操作中,我正在传递400MB的对象)

Download and load the model on driver as a static object and send it to executor in each map operation. -----> Working (Not efficient, as in each map operation, i am passing 400MB object)

在驱动程序上下载模型,并将其加载到每个执行器上,并将其缓存在该模型中. (不知道该怎么做)

Download the model on driver and load it on each executor and cache it there. (Don't know how to do that)

有人知道如何在每个执行器上加载一次模型并对其进行缓存,这样对于其他记录,我就不会再次加载它?

Does someone have any idea how can I load the model on each executor once and cache it so that for other records I don't load it again?

推荐答案

您有两个选择:

    object WekaModel {
        lazy val data = {
            // initialize data here. This will only happen once per JVM process
        }
    }       

然后,您可以在map函数中使用lazy val. lazy val确保每个辅助JVM初始化自己的数据实例. data不会执行序列化或广播.

Then, you can use the lazy val in your map function. The lazy val ensures that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for data.

    elementsRDD.map { element =>
        // use WekaModel.data here
    }

优势

  • 效率更高,因为它允许您为每个JVM实例初始化一次数据.例如,在需要初始化数据库连接池时,此方法是一个不错的选择.

缺点

  • 对初始化的控制较少.例如,如果需要运行时参数,则初始化对象会比较棘手.
  • 如果需要,您无法真正释放或释放对象.通常,这是可以接受的,因为当进程退出时,操作系统会释放资源.

这使您可以初始化整个分区所需的任何内容.

This allows you to initialize whatever you need for the entire partition.

    elementsRDD.mapPartition { elements =>
        val model = new WekaModel()

        elements.map { element =>
            // use model and element. there is a single instance of model per partition.
        }
    }

优势:

  • 在对象的初始化和取消初始化方面提供更大的灵活性.

缺点

  • 每个分区将创建并初始化对象的新实例.根据每个JVM实例有多少个分区,这可能是问题,也可能不是问题.

这篇关于如何在火花中对每个执行器执行一次操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆