仅将 Spark ML 管道用于转换 [英] Using Spark ML Pipelines just for Transformations

查看:29
本文介绍了仅将 Spark ML 管道用于转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一个项目,在该项目中,可配置的管道和对 Spark DataFrame 更改的沿袭跟踪都是必不可少的.此管道的端点通常只是修改后的 DataFrame(将其视为 ETL 任务).对我来说最有意义的是利用现有的 Spark ML Pipeline API 来跟踪这些更改.特别是,更改(基于其他人添加列等)是作为自定义 Spark ML Transformer 实现的.

I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers.

但是,我们现在正在内部讨论这是否是实现此管道的最惯用的方式.另一种选择是将这些转换实现为一系列 UDF,并根据 DataFrame 的模式历史(或 Spark 的内部 DF 沿袭跟踪)构建我们自己的沿袭跟踪.这方面的论点是,Spark 的 ML 管道不仅仅用于 ETL 作业,而且应该始终以生成可提供给 Spark ML Evaluator 的列为目标来实现.反对这一方面的论点是,它需要大量的工作来反映现有的功能.

However, we are now having an internal about debate whether or not this is the most idiomatic way of implementing this pipeline. The other option would be to implement these transformations as series of UDFs and to build our own lineage tracking based on a DataFrame's schema history (or Spark's internal DF lineage tracking). The argument for this side is that Spark's ML pipelines are not intended just ETL jobs, and should always be implemented with goal of producing a column which can be fed to a Spark ML Evaluator. The argument against this side is that it requires a lot of work that mirrors already existing functionality.

将 Spark 的 ML Pipelines 严格用于 ETL 任务有什么问题吗?仅使用 Transformer 而未包含 Evaluator 的任务?

Is there any problem with leveraging Spark's ML Pipelines strictly for ETL tasks? Tasks that only make use of Transformers and don't include Evaluators?

推荐答案

对我来说,这似乎是一个好主意,特别是如果您可以将生成的不同管道组合成新的管道,因为管道本身可以由不同的管道组成管道从 PipelineStage 向上延伸到树(来源:https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline).

For me, seems like a great idea, especially if you can compose the different Pipelines generated into new ones since a Pipeline can itself be made of different pipelines since a Pipeline extends from PipelineStage up the tree (source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline).

但请记住,您可能会按照此处所述(https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-mllib/spark-mllib-transformers.html):

But keep in mind that you will probably being doing the same thing under the hood as explained here (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-mllib/spark-mllib-transformers.html):

在内部,transform 方法使用 Spark SQL 的 udf 定义一个函数(基于上述 createTransformFunc 函数),该函数将创建新的输出列(具有适当的 outputDataType).UDF 稍后应用于输入 DataFrame 的输入列,结果成为输出列(使用 DataFrame.withColumn 方法).

Internally, transform method uses Spark SQL’s udf to define a function (based on createTransformFunc function described above) that will create the new output column (with appropriate outputDataType). The UDF is later applied to the input column of the input DataFrame and the result becomes the output column (using DataFrame.withColumn method).

如果您决定采用其他方法或找到更好的方法,请发表评论.很高兴分享有关 Spark 的知识.

If you have decided for other approach or found a better way, please, comment. It's nice to share knowledge about Spark.

这篇关于仅将 Spark ML 管道用于转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆