Spark 结构化流和 Spark-Ml 回归 [英] Spark Structured Streaming and Spark-Ml Regression
问题描述
是否可以将 Spark-Ml 回归应用于流媒体源?我看到有 StreamingLogisticRegressionWithSGD
但它是针对较旧的 RDD API 而我 无法将其与结构化流媒体源一起使用.
Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD
but It's for older RDD API and I couldn't use It with structured streaming sources.
- 我应该如何对结构化流媒体源应用回归?
- (有点过时)如果我不能使用流 API 进行回归,我如何以批处理方式提交偏移量?(卡夫卡接收器)
推荐答案
今天(Spark 2.2/2.3)不支持结构化流中的机器学习,也没有在这个方向上正在进行的工作.请按照 SPARK-16424 跟踪未来的进展.
Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured Streaming and there is no ongoing work in this direction. Please follow SPARK-16424 to track future progress.
但是你可以:
使用 forEach sink 和某种形式的外部状态存储训练迭代的非分布式模型.在高级别的回归模型可以这样实现:
Train iterative, non-distributed models using forEach sink and some form of external state storage. At a high level regression model could be implemented like this:
- 在调用
ForeachWriter.open
时获取最新模型,并为分区初始化损失累加器(不是 Spark 意义上的,只是局部变量). - 计算
ForeachWriter.process
中每条记录的损失并更新累加器. - 在调用
ForeachWriter.close
时推送丢失到外部存储. - 这将使外部存储负责计算梯度和更新模型,实现依赖于存储.
- Fetch latest model when calling
ForeachWriter.open
and initialize loss accumulator (not in a Spark sense, just local variable) for the partition. - Compute loss for each record in
ForeachWriter.process
and update accumulator. - Push loses to external store when calling
ForeachWriter.close
. - This would leave external storage in charge with computing gradient and updating model with implementation dependent on the store.
尝试破解 SQL 查询(参见 https://github.com/holdenk/spark-structured-streaming-ml 来自 Holden Karau)
Try to hack SQL queries (see https://github.com/holdenk/spark-structured-streaming-ml by Holden Karau)
这篇关于Spark 结构化流和 Spark-Ml 回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!