什么是保存在星火\\ PySpark \\负荷模型的正确方法 [英] What is the right way to save\load models in Spark\PySpark

查看:344
本文介绍了什么是保存在星火\\ PySpark \\负荷模型的正确方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用PySpark和MLlib星火1.3.0工作,我需要保存和载入我的模型。我用code这样(从官方文档拍摄)

I'm working with Spark 1.3.0 using PySpark and MLlib and I need to save and load my models. I use code like this (taken from the official documentation )

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
predictions.collect() # shows me some predictions
model.save(sc, "model0")

# Trying to load saved model and work with it
model0 = MatrixFactorizationModel.load(sc, "model0")
predictions0 = model0.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))

在我尝试使用model0我弄了半天回溯,这只能到此为止:

After I try to use model0 I get a long traceback, which ends with this:

Py4JError: An error occurred while calling o70.predict. Trace:
py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

所以我的问题是 - 我做错了什么?至于我调试我的模型存储(本地和HDFS)和它们所包含的一些数据多个文件。我有一个模型是正确保存,但可能他们没有正确加载的感觉。我也用Google搜索周围却一无所获相关。

So my question is - am I doing something wrong? As far as I debugged my models are stored (locally and on HDFS) and they contain many files with some data. I have a feeling that models are saved correctly but probably they aren't loaded correctly. I also googled around but found nothing related.

看起来像这样保存\\加载功能已被添加在近期因为这个星火1.3.0,我有另外一个问题 - 什么是推荐的方式发布1.3.0之前保存\\负荷模型?我没有发现任何好的方法可以做到这一点,至少对Python的。我也试过味酸,但这里保存阿帕奇星火mllib描述面临着同样的问题模型蟒蛇

Looks like this save\load feature has been added recently in Spark 1.3.0 and because of this I have another question - what was the recommended way to save\load models before the release 1.3.0? I haven't found any nice ways to do this, at least for Python. I also tried Pickle, but faced with the same issues as described here Save Apache Spark mllib model in python

推荐答案

这个拉请求上2015年3月28日合并(一天后,你的问题上次编辑)这个问题已经解决了。

As of this pull request merged on Mar 28, 2015 (a day after your question was last edited) this issue has been resolved.

您只需要复制/从GitHub获取最新版本(混帐混帐克隆://github.com/apache/spark.git -b分支1.3 )然后建立其与 $ MVN -DskipTests清洁套装(按照火花/ README.md 的说明)。

You just need to clone/fetch the latest version from GitHub (git clone git://github.com/apache/spark.git -b branch-1.3) then build it (following the instructions in spark/README.md) with $ mvn -DskipTests clean package.

请注意:我遇到了麻烦星火建设Maven的,因为当时正在靠不住的。我解决了这个问题,通过使用 $更新替代品--config MVN 和选择有重点的路径:150,这意味着什么。 说明这里

Note: I ran into trouble building Spark because Maven was being wonky. I resolved that issue by using $ update-alternatives --config mvn and selecting the 'path' that had Priority: 150, whatever that means. Explanation here.

这篇关于什么是保存在星火\\ PySpark \\负荷模型的正确方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆