Spark中的在线(增量)逻辑回归 [英] Online (incremental) logistic regression in Spark

查看:318
本文介绍了Spark中的在线(增量)逻辑回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark MLlib(基于RDD的API)中,有StreamingLogisticRegressionWithSGD用于对Logistic回归模型进行增量训练.但是,该类已被弃用,并且提供的功能很少(例如,无法访问模型系数和输出概率).

In Spark MLlib (RDD-based API) there is the StreamingLogisticRegressionWithSGD for incremental training of a Logistic Regression model. However, this class has been deprecated and offers little functionality (eg no access to model coefficients and output probabilities).

在Spark ML(基于DataFrame的API)中,我仅找到类LogisticRegression,仅具有fit方法用于批处理训练.这不允许进行模型保存,重新加载和增量训练的模式.

In Spark ML (DataFrame-based API) I only find the class LogisticRegression, having only the fit method for batch training. This doesn't allow for a pattern of model-saving, reloading and incremental training.

不用说,某些应用程序会从增量学习中受益匪浅. Spark中有任何可用的解决方案吗?

Needless to say some applications benefit greatly from incremental learning. Is there any solution available in Spark?

推荐答案

在Spark ML中,当您调用LogisticRegression.fit()时,将获得一个LogisticRegressionModel.然后,您可以将LogisticRegressionModel添加到管道并保存/加载用于增量培训的管道.

In Spark ML, when you call LogisticRegression.fit() you get a LogisticRegressionModel. You can then add the LogisticRegressionModel to a Pipeline and save/load the pipeline for incremental training.

val lr = new LogisticRegression()
val pipeline = new Pipeline().setStages(Array(lr))
model = pipeline.fit(data)
model.write.overwrite().save("/tmp/saved_model")

如果要使用流数据训练模型或将其应用于流数据,则可以定义

If you want to train the model with streaming data or apply it to streaming data, you can define a Structured Streaming dataframe and pass it to the pipeline.

例如(摘录自火花文档):

// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
  .readStream
  .option("sep", ";")
  .schema(userSchema)      // Specify schema of the csv files
  .csv("/path/to/directory")    // Equivalent to format("csv").load("/path/to/directory")

这篇关于Spark中的在线(增量)逻辑回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆