使用训练有素的Spark ML模型提供实时预测 [英] Serve real-time predictions with trained Spark ML model

查看：505 发布时间：2020/9/4 5:56:48 apache-spark pyspark apache-spark-ml

本文介绍了使用训练有素的Spark ML模型提供实时预测的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们目前正在测试基于Spark在Python中LDA实现的预测引擎: https://spark.apache.org/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda https://spark .apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA (我们使用的是pyspark.ml软件包，而不是pyspark.mllib)

We are currently testing a prediction engine based on Spark's implementation of LDA in Python: https://spark.apache.org/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA (we are using the pyspark.ml package, not pyspark.mllib)

我们能够成功地在Spark集群上训练模型(使用Google Cloud Dataproc).现在，我们正在尝试使用该模型作为API(例如flask应用程序)提供实时预测.

We were able to succesfuly train a model on a Spark cluster (using Google Cloud Dataproc). Now we are trying to use the model to serve real-time predictions as an API (e.g. flask application).

实现这一目标的最佳方法是什么?

What would be the best approach to achieve so?

我们的主要痛点是，似乎我们需要带回整个Spark环境，以加载经过训练的模型并运行转换. 到目前为止，我们已经尝试为每个收到的请求在本地模式下运行Spark，但是这种方法为我们提供了

Our main pain point is that it seems we need to bring back the whole Spark environnement in order to load the trained model and run the transform. So far we've tried running Spark in local mode for each received request but this approach gave us:

表现不佳(启动SparkSession，加载模型，运行转换...的时间)
可伸缩性差(无法处理并发请求)

整个方法似乎很繁琐，是否会有更简单的替代方法，甚至根本不需要暗示Spark?

The whole approach seems quite heavy, would there be a simpler alternative, or even one that would not need to imply Spark at all?

下面是训练和预测步骤的简化代码.

Bellow are simplified code of the training and prediction steps.

def train(input_dataset):   
    conf = pyspark.SparkConf().setAppName("lda-train")
    spark = SparkSession.builder.config(conf=conf).getOrCreate()

    # Generate count vectors
    count_vectorizer = CountVectorizer(...)
    vectorizer_model = count_vectorizer.fit(input_dataset)
    vectorized_dataset = vectorizer_model.transform(input_dataset)

    # Instantiate LDA model
    lda = LDA(k=100, maxIter=100, optimizer="em", ...)

    # Train LDA model
    lda_model = lda.fit(vectorized_dataset)

    # Save models to external storage
    vectorizer_model.write().overwrite().save("gs://...")
    lda_model.write().overwrite().save("gs://...")

预测代码

def predict(input_query):
    conf = pyspark.SparkConf().setAppName("lda-predict").setMaster("local")
    spark = SparkSession.builder.config(conf=conf).getOrCreate()

    # Load models from external storage
    vectorizer_model = CountVectorizerModel.load("gs://...")
    lda_model = DistributedLDAModel.load("gs://...")

    # Run prediction on the input data using the loaded models
    vectorized_query = vectorizer_model.transform(input_query)
    transformed_query = lda_model.transform(vectorized_query)

    ...

    spark.stop()

    return transformed_query

使用训练有素的Spark ML模型提供实时预测 [英] Serve real-time predictions with trained Spark ML model

问题描述

预测代码

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用训练有素的Spark ML模型提供实时预测 [英] Serve real-time predictions with trained Spark ML model

问题描述

预测代码

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭