保存 ML 模型以备将来使用 [英] Save ML model for future usage

查看:43
本文介绍了保存 ML 模型以备将来使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对一些数据应用了一些机器学习算法,如线性回归、逻辑回归和朴素贝叶斯,但我试图避免使用 RDD 并开始使用数据帧,因为 这个类在拟合之后返回一个模型,显然这个方法要测试几个场景,然后返回一个拟合模型(参数的最佳组合).

我使用的集群不是很大,数据也很大,有些拟合需要几个小时,所以我想保存这些模型以供以后重用,但我还没有意识到如何,有什么我忽略的吗?

注意事项:

  • mllib 的模型类有一个保存方法(即 NaiveBayes),但 mllib 没有 CrossValidator 并使用 RDD,所以我有预谋地避免使用它.
  • 当前版本是 spark 1.5.1.

解决方案

Spark 2.0.0+

乍一看,所有 TransformersEstimators 都实现了 MLWritable 具有以下界面:

def write: MLWriterdef save(path: String): 单位

MLReadable 接口如下

def read: MLReader[T]定义加载(路径:字符串):T

这意味着您可以使用save方法将模型写入磁盘,例如

import org.apache.spark.ml.PipelineModelval 模型:管道模型model.save("/path/to/model")

稍后阅读:

val reloadedModel: PipelineModel = PipelineModel.load("/path/to/model")

在 PySpark 中也使用 MLWritable/JavaMLWritableMLReadable/JavaMLReadable 分别:

from pyspark.ml import Pipeline, PipelineModel模型 = 流水线(...).fit(df)model.save("/path/to/model")reloaded_model = PipelineModel.load("/path/to/model")

SparkR 提供 write.ml/read.ml 函数,但截至今天,这些与其他支持的语言不兼容 - SPARK-15572.

注意加载器类必须匹配存储的类 PipelineStage.例如,如果您保存了 LogisticRegressionModel,您应该使用 LogisticRegressionModel.load 而不是 LogisticRegression.load.

如果您使用 Spark <= 1.6.0 并在模型保存方面遇到一些问题,我建议您切换版本.

除了特定于 Spark 的方法之外,越来越多的库旨在使用 Spark 独立方法保存和加载 Spark ML 模型.参见例如如何提供 Spark MLlib 模型?.

火花 >= 1.6

从 Spark 1.6 开始,可以使用 save 方法保存模型.因为几乎每个 model 都实现了 MLWritable 接口.例如,LinearRegressionModel 拥有它,因此可以使用它将您的模型保存到所需的路径.

火花<1.6

我相信您在这里做出了错误的假设.

DataFrames 上的某些操作可以优化,与普通的 RDD 相比,它可以转化为改进的性能.DataFrames 提供高效的缓存,SQLish API 可以说比 RDD API 更容易理解.

ML 管道非常有用,像交叉验证器或不同的评估器这样的工具在任何机器管道中都是必不可少的,即使以上都不是特别难在低级 MLlib API 之上实现,它会更好已准备好使用、通用且经过相对良好测试的解决方案.

到目前为止一切顺利,但存在一些问题:

  • 据我所知,DataFrames 上的简单操作(如 selectwithColumn)显示的性能与其 RDD 等价物(如 )相似地图,
  • 在某些情况下,与经过良好调整的低级转换相比,增加典型管道中的列数实际上会降低性能.您当然可以在纠正此问题的过程中添加 drop-column-transformers,
  • 许多机器学习算法,包括 ml.classification.NaiveBayes 只是其mllib API 的简单包装
  • PySpark ML/MLlib 算法将实际处理委托给其对应的 Scala 算法,
  • 最后但并非最不重要的一点是 RDD 仍然存在,即使隐藏在 DataFrame API 后面

我相信,在一天结束时,通过使用 ML 而非 MLLib 所获得的将是非常优雅的高级 API.您可以做的一件事是将两者结合起来创建一个自定义的多步骤管道:

  • 使用机器学习加载、清理和转换数据,
  • 提取所需数据(参见例如extractLabeledPoints 方法)并传递给 MLLib 算法,
  • 添加自定义交叉验证/评估
  • 使用您选择的方法(Spark 模型或 PMML)

这不是最佳解决方案,但在给定当前 API 的情况下,这是我能想到的最佳解决方案.

I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark (see pic 1).

The other reason why I am using DataFrames is because the ml library has a class very useful to tune models which is CrossValidator this class returns a model after fitting it, obviously this method has to test several scenarios, and after that returns a fitted model (with the best combinations of parameters).

The cluster I use isn't so large and the data is pretty big and some fitting take hours so I want to save this models to reuse them later, but I haven't realized how, is there something I am ignoring?

Notes:

  • The mllib's model classes have a save method (i.e. NaiveBayes), but mllib does not have CrossValidator and use RDDs so I am avoiding it premeditatedly.
  • The current version is spark 1.5.1.

解决方案

Spark 2.0.0+

At first glance all Transformers and Estimators implement MLWritable with the following interface:

def write: MLWriter
def save(path: String): Unit 

and MLReadable with the following interface

def read: MLReader[T]
def load(path: String): T

This means that you can use save method to write model to disk, for example

import org.apache.spark.ml.PipelineModel

val model: PipelineModel
model.save("/path/to/model")

and read it later:

val reloadedModel: PipelineModel = PipelineModel.load("/path/to/model")

Equivalent methods are also implemented in PySpark with MLWritable / JavaMLWritable and MLReadable / JavaMLReadable respectively:

from pyspark.ml import Pipeline, PipelineModel

model = Pipeline(...).fit(df)
model.save("/path/to/model")

reloaded_model = PipelineModel.load("/path/to/model")

SparkR provides write.ml / read.ml functions, but as of today, these are not compatible with other supported languages - SPARK-15572.

Note that the loader class has to match the class of the stored PipelineStage. For example if you saved LogisticRegressionModel you should use LogisticRegressionModel.load not LogisticRegression.load.

If you use Spark <= 1.6.0 and experience some issues with model saving I would suggest switching version.

Additionally to the Spark specific methods there is a growing number of libraries designed to save and load Spark ML models using Spark independent methods. See for example How to serve a Spark MLlib model?.

Spark >= 1.6

Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implements the MLWritable interface. For example, LinearRegressionModel has it, and therefore it's possible to save your model to the desired path using it.

Spark < 1.6

I believe you're making incorrect assumptions here.

Some operations on a DataFrames can be optimized and it translates to improved performance compared to plain RDDs. DataFrames provide efficient caching and SQLish API is arguably easier to comprehend than RDD API.

ML Pipelines are extremely useful and tools like cross-validator or different evaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.

So far so good, but there are a few problems:

  • as far as I can tell simple operations on a DataFrames like select or withColumn display similar performance to its RDD equivalents like map,
  • in some cases growing the number of columns in a typical pipeline can actually degrade performance compared to well tuned low level transformations. You can of course add drop-column-transformers on the way to correct for that,
  • many ML algorithms, including ml.classification.NaiveBayes are simply wrappers around its mllib API,
  • PySpark ML/MLlib algorithms delegate actual processing to its Scala counterparts,
  • last but not least RDD is still out there, even if well hidden behind DataFrame API

I believe that at the end of the day what you get by using ML over MLLib is quite elegant, high level API. One thing you can do is to combine both to create a custom multi-step pipeline:

  • use ML to load, clean and transform data,
  • extract required data (see for example extractLabeledPoints method) and pass to MLLib algorithm,
  • add custom cross-validation / evaluation
  • save MLLib model using a method of your choice (Spark model or PMML)

It is not an optimal solution, but is the best one I can think of given a current API.

这篇关于保存 ML 模型以备将来使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆