如何在火花流作业期间更新ML模型而无需重新启动应用程序? [英] How to update a ML model during a spark streaming job without restarting the application?
问题描述
我有一个Spark Streaming工作,其目标是:
I've got a Spark Streaming job whose goal is to :
- 阅读一批消息
- 使用预先训练的ML管道根据这些消息预测变量Y
问题是,我希望能够更新执行者使用的模型而无需重新启动应用程序.
The problem is, I'd like to be able to update the model used by the executors without restarting the application.
简单地说,这是它的样子:
Simply put, here's what it looks like :
model = #model initialization
def preprocess(keyValueList):
#do some preprocessing
def predict(preprocessedRDD):
if not preprocessedRDD.isEmpty():
df = #create df from rdd
df = model.transform(df)
#more things to do
stream = KafkaUtils.createDirectStream(ssc, [kafkaTopic], kafkaParams)
stream.mapPartitions(preprocess).foreachRDD(predict)
在这种情况下,仅使用模型.未更新.
In this case, the model is simply used. Not updated.
我已经考虑过几种可能性,但是现在我已经将它们划掉了:
I've thought about several possibilities but I have now crossed them all out :
- 每次更改模型时广播模型(无法更新,只读)
- 在执行程序上从HDFS读取模型(它需要SparkContext,因此是不可能的)
有什么主意吗?
非常感谢!
推荐答案
我以前已经通过两种不同的方式解决了这个问题:
I've solved this issue before in two different ways:
- 模型上的TTL
- 每批重新读取模型
这两种解决方案都要求对您定期(例如每天一次)积累的数据进行额外的工作培训.
Both those solutions suppose an additional job training on the data you've accumulated regularly (e.g. once a day).
这篇关于如何在火花流作业期间更新ML模型而无需重新启动应用程序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!