保存ML模型以便将来使用 [英] Save ML model for future usage

查看：1598 发布时间：2016/5/22 15:14:31 apache-spark pyspark apache-spark-mllib apache-spark-ml

本文介绍了保存ML模型以便将来使用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我申请了一些机器学习算法，如线性回归，Logistic回归，以及朴素贝叶斯一些数据，但我试图避免使用RDDS并开始使用DataFrames因为的 RDDS正在pyspark比Dataframes 慢（见图片1 ）。

I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark (see pic 1).

其他的原因，我现在用的DataFrames是因为毫升库有一个类来调整模型非常有用的是<一个href=\"http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=crossvalidator#pyspark.ml.tuning.CrossValidator\">CrossValidator这个类装修它，显然这种方法来测试几种方案后返回模式，返回后的fitted模型（与参数最佳组合）。

The other reason why I am using DataFrames is because the ml library has a class very useful to tune models which is CrossValidator this class returns a model after fitting it, obviously this method has to test several scenarios, and after that returns a fitted model (with the best combinations of parameters).

我使用集群不是那么大，数据是pretty大，有些装修需要几个小时，所以我要保存这个模型以后重新使用他们，但我还没有意识到怎么回事，是不是我我忽略了？

The cluster I use isn't so large and the data is pretty big and some fitting take hours so I want to save this models to reuse them later, but I haven't realized how, is there something I am ignoring?

注：

的mllib的模型类有一个保存方法（即<一href=\"http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=mllib#pyspark.mllib.classification.NaiveBayesModel.save\">NaiveBayes),但mllib没有CrossValidator和使用RDDS所以我避免它premeditatedly。

的当前版本为1.5.1的火花。

The mllib's model classes have a save method (i.e. NaiveBayes), but mllib does not have CrossValidator and use RDDs so I am avoiding it premeditatedly.
The current version is spark 1.5.1.

Spark >= 1.6

Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implement MLWritable interface. For example, LinearRegressionModel has it, and therefore it's possible to save your model to the desired path using it.

我相信你在这里做不正确的假设。

I believe you're making incorrect assumptions here.

上的某些操作的 DataFrames 可以进行优化，并将其转换为更高的性能相比普通的 RDDS 。 DataFrames 提供高效的缓存缓存和SQLish API无疑更容易COM prehend比RDD API。

Some operations on a DataFrames can be optimized and it translates to improved performance compared to plain RDDs. DataFrames provide efficient caching caching and SQLish API is arguably easier to comprehend than RDD API.

ML管道是非常有用的，像交叉验证器或不同的评估工具只是必须具备在任何机器管道，即使没有上述特别辛苦做低水平MLlib API之上实现它好得多必须准备好使用，普遍性和相对良好测试的解决方案。

ML Pipelines are extremely useful and tools like cross-validator or different evaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.

到目前为止好，但也存在一些问题：

So far so good, but there are a few problems:

据我可以告诉在 DataFrames 如选择或 withColumn简单的操作显示性能类似于其RDD等值像地图，

在某些情况下，在一个典型的管道成长列数实际上可以降低性能相比以及调谐低电平变换。你当然也可以在途中添加下拉列变压器纠正的是，

许多ML算法，包括 ml.classification.NaiveBayes 是只是一个包装围绕其 mllib API 的，

PySpark ML / MLlib算法委托实际处理其斯卡拉同行，

最后但并非最不重要RDD还是摆在那里，即便以及背后隐藏的数据帧API

as far as I can tell simple operations on a DataFrames like select or withColumn display similar performance to its RDD equivalents like map,
in some cases growing number of columns in a typical pipeline can actually degrade performance compared to well tuned low level transformations. You can of course add drop-column-transformers on the way to correct for that,
many ML algorithms, including ml.classification.NaiveBayes are simply a wrappers around its mllib API,
PySpark ML/MLlib algorithms delegate actual processing to its Scala counterparts,
last but not least RDD is still out there, even if well hidden behind DataFrame API

我相信，在您使用ML过MLLib得到什么的一天结束也相当考究，高级API。有一件事你可以做的是结合两种是创建一个自定义的多级管道：

I believe that at the end of the day what you get by using ML over MLLib is quite elegant, high level API. One thing you can do is to combine both is to create a custom multi-step pipeline:

使用ML加载，清理和转换数据，

提取所需的数据（例如，见<一href=\"https://github.com/apache/spark/blob/098be27ad53c485ee2fc7f5871c47f899020e87b/mllib/src/main/scala/org/apache/spark/ml/$p$pdictor.scala#L123\"相对=nofollow> extractLabeledPoints 方法），并传递给 MLLib 算法

添加自定义交叉验证/评估

保存 MLLib 模型（星火示范或的 PMML ）

use ML to load, clean and transform data,
extract required data (see for example extractLabeledPoints method) and pass to MLLib algorithm,
add custom cross-validation / evaluation
save MLLib model using a method of your choice (Spark model or PMML)

这不是一个最佳的解决方案，但我能想到给当前API的最佳选择之一。

It is not an optimal solution, but is the best one I can think of given a current API.

这篇关于保存ML模型以便将来使用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

保存ML模型以便将来使用 [英] Save ML model for future usage

问题描述

推荐答案

Spark >= 1.6

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

保存ML模型以便将来使用 [英] Save ML model for future usage

问题描述

推荐答案

Spark >= 1.6

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭