仅将 Spark ML 管道用于转换 [英] Using Spark ML Pipelines just for Transformations

查看：29 发布时间：2021/11/14 21:10:16 apache-spark apache-spark-mllib apache-spark-ml

本文介绍了仅将 Spark ML 管道用于转换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在从事一个项目，在该项目中，可配置的管道和对 Spark DataFrame 更改的沿袭跟踪都是必不可少的.此管道的端点通常只是修改后的 DataFrame(将其视为 ETL 任务).对我来说最有意义的是利用现有的 Spark ML Pipeline API 来跟踪这些更改.特别是，更改(基于其他人添加列等)是作为自定义 Spark ML Transformer 实现的.

I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers.

但是，我们现在正在内部讨论这是否是实现此管道的最惯用的方式.另一种选择是将这些转换实现为一系列 UDF，并根据 DataFrame 的模式历史(或 Spark 的内部 DF 沿袭跟踪)构建我们自己的沿袭跟踪.这方面的论点是，Spark 的 ML 管道不仅仅用于 ETL 作业，而且应该始终以生成可提供给 Spark ML Evaluator 的列为目标来实现.反对这一方面的论点是，它需要大量的工作来反映现有的功能.

However, we are now having an internal about debate whether or not this is the most idiomatic way of implementing this pipeline. The other option would be to implement these transformations as series of UDFs and to build our own lineage tracking based on a DataFrame's schema history (or Spark's internal DF lineage tracking). The argument for this side is that Spark's ML pipelines are not intended just ETL jobs, and should always be implemented with goal of producing a column which can be fed to a Spark ML Evaluator. The argument against this side is that it requires a lot of work that mirrors already existing functionality.

将 Spark 的 ML Pipelines 严格用于 ETL 任务有什么问题吗?仅使用 Transformer 而未包含 Evaluator 的任务?

Is there any problem with leveraging Spark's ML Pipelines strictly for ETL tasks? Tasks that only make use of Transformers and don't include Evaluators?

仅将 Spark ML 管道用于转换 [英] Using Spark ML Pipelines just for Transformations

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

仅将 Spark ML 管道用于转换 [英] Using Spark ML Pipelines just for Transformations

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭