无法从 pyspark 加载管道模型 [英] Cannot load pipeline model from pyspark

查看:45
本文介绍了无法从 pyspark 加载管道模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我尝试在 pyspark 中使用管道模型加载保存的管道.

Hello I try to load saved pipeline with Pipeline Model in pyspark.

    selectedDf = reviews\
        .select("reviewerID", "asin", "overall")

    # Make pipeline to build recommendation
    reviewerIndexer = StringIndexer(
        inputCol="reviewerID",
        outputCol="intReviewer"
        )
    productIndexer = StringIndexer(
        inputCol="asin",
        outputCol="intProduct"
        )
    pipeline = Pipeline(stages=[reviewerIndexer, productIndexer])
    pipelineModel = pipeline.fit(selectedDf)
    transformedFeatures = pipelineModel.transform(selectedDf)
    pipeline_model_name = './' + model_name + 'pipeline'
    pipelineModel.save(pipeline_model_name)

此代码成功地将模型保存在文件系统中,但问题是我无法加载此管道以将其用于其他数据.当我尝试使用以下代码加载模型时,出现此类错误.

This code successfully save model in filesystem but the problem is that I can't load this pipeline to utilize it on other data. When I try to load model with following code I have this kind of error.

        pipelineModel = PipelineModel.load(pipeline_model_name)

Traceback (most recent call last):
  File "/app/spark/load_recommendation_model.py", line 12, in <module>
    sa.load_model(pipeline_model_name, recommendation_model_name, user_id)
  File "/app/spark/sparkapp.py", line 142, in load_model
    pipelineModel = PipelineModel.load(pipeline_model_name)
  File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 311, in load
  File "/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 240, in load
  File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 497, in loadMetadata
  File "/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1379, in first
ValueError: RDD is empty

有什么问题?我该如何解决这个问题?

What is the problem? How can I solve this?

推荐答案

我遇到了同样的问题.问题是我在节点集群上运行 Spark,但我没有使用共享文件系统来保存我的模型.因此,保存经过训练的模型会导致将模型数据保存在 Spark 工作人员的内存中.当我想加载数据时,我使用了在保存过程中使用的相同路径.在这种情况下,Spark master 会在 ITS LOCAL 的指定路径中寻找模型,但那里的数据并不完整.因此,它断言 RDD(数据)为空(如果您查看保存模型的目录,您将看到只有 SUCCESS 文件,但对于加载模型,另外两个part-0000 文件是必需的).

I had the same issue. The problem was that I was running Spark on a cluster of nodes, but I wasn't using a shared file system to save my models. Thus, saving the trained model leaded to saving the model's data on the Spark workers which had the data in their memory. When I wanted to load the data, I used the same path which I used in the saving process. In this situation, Spark master goes and looks for the model in the specified path in ITS LOCAL, but the data is not complete there. Therefore, it asserts that the RDD (the data) is empty (if you take a look at the directory of the saved model you will see that there are only SUCCESS files, but for loading models, two other part-0000 files are necessary).

使用像 HDFS 这样的共享文件系统可以解决这个问题.

Using shared file systems like HDFS will fix the problem.

这篇关于无法从 pyspark 加载管道模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆