在python中保存Apache Spark mllib模型 [英] Save Apache Spark mllib model in python
问题描述
我正在尝试将拟合模型保存到Spark中的文件.我有一个Spark集群,可以训练RandomForest模型.我想保存拟合模型并在另一台机器上重复使用.我在网上阅读了一些建议进行Java序列化的文章.我在python中做同样的事情,但是不起作用.诀窍是什么?
I am trying to save a fitted model to a file in Spark. I have a Spark cluster which trains a RandomForest model. I would like to save and reuse the fitted model on another machine. I read some posts on the web which recommends to do java serialization. I am doing the equivalent in python but it does not work. What is the trick?
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
numTrees=nb_tree,featureSubsetStrategy="auto",
impurity='variance', maxDepth=depth)
output = open('model.ml', 'wb')
pickle.dump(model,output)
我收到此错误:
TypeError: can't pickle lock objects
我正在使用Apache Spark 1.2.0.
I am using Apache Spark 1.2.0.
推荐答案
如果查看源代码,您会看到RandomForestModel
继承自TreeEnsembleModel
,而TreeEnsembleModel
又继承自JavaSaveable
类,实现save()
方法,因此您可以像下面的示例一样保存模型:
If you look at the source code, you'll see that the RandomForestModel
inherits from the TreeEnsembleModel
which in turn inherits from JavaSaveable
class that implements the save()
method, so you can save your model like in the example below:
model.save([spark_context], [file_path])
因此它将使用spark_context
将model
保存到file_path
中.您不能(至少到现在为止)使用Python nativle pickle来做到这一点.如果确实要这样做,则需要手动实现方法__getstate__
或__setstate__
.有关更多信息,请参见此泡菜文档.
So it will save the model
into the file_path
using the spark_context
. You cannot use (at least until now) the Python nativle pickle to do that. If you really want to do that, you'll need to implement the methods __getstate__
or __setstate__
manually. See this pickle documentation for more information.
这篇关于在python中保存Apache Spark mllib模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!