使用Spark结构化流进行实时数据标准化/标准化 [英] Real-time data standardization / normalization with Spark structured streaming

本文介绍了使用Spark结构化流进行实时数据标准化/标准化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在实现机器学习算法时,标准化/标准化数据是必不可少的(即使不是至关重要的)要点.在过去的两周里,我一直试图解决使用Spark结构化流进行实时处理的问题.

Standardizing / normalizing data is an essential, if not a crucial, point when it comes to implementing machine learning algorithms. Doing so on a real time manner using Spark structured streaming has been a problem I've been trying to tackle for the past couple of weeks.

在历史数据上使用StandardScaler估计器((value(i)-mean) /standard deviation)证明是很好的,并且在我的用例中,这是获得合理聚类结果的最佳方法,但是我不确定如何将StandardScaler模型与实时数据.结构化流式传输不允许这样做.任何建议将不胜感激!

Using the StandardScaler estimator ((value(i)-mean) /standard deviation) on historical data proved to be great, and in my use case it is the best, to get reasonable clustering results, but I'm not sure how to fit StandardScaler model with real-time data. Structured streaming does not allow it. Any advice would be highly appreciated!

换句话说,如何在Spark结构化流中适应模型?

In other words, how to fit models in Spark structured streaming?

推荐答案

我对此有一个答案.目前无法通过Spark结构化流进行实时机器学习,包括标准化.但是,对于某些算法,如果构建/拟合了离线模型,则可以进行实时预测.

I got an answer for this. It's not possible at the moment to do real time machine learning with Spark structured streaming, inluding normalization; however, for some algorithms making real time predictions is possible if an offline model was built/fitted.

检查:

JIRA-向ML Pipeline API添加对结构化流的支持

Google DOC-基于结构化流的机器学习

这篇关于使用Spark结构化流进行实时数据标准化/标准化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆