仅将Spark ML管道用于转换 [英] Using Spark ML Pipelines just for Transformations

查看:70
本文介绍了仅将Spark ML管道用于转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一个项目,在该项目中,可配置的管道和对Spark DataFrame的更改的沿袭跟踪都是必不可少的.该管道的端点通常只是修改的DataFrame(将其视为ETL任务).对我而言,最有意义的是利用现有的Spark ML Pipeline API来跟踪这些更改.特别是,更改(基于其他列添加列等)被实现为自定义Spark ML变形金刚.

I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers.

但是,我们现在正在进行内部辩论,这是否是实施此管道的最惯用的方法.另一个选择是将这些转换实现为一系列UDF,并基于DataFrame的架构历史记录(或Spark的内部DF沿袭跟踪)构建我们自己的沿袭跟踪.这方面的论点是,Spark的ML管道不仅仅用于ETL作业,而且应始终以生成可馈入Spark ML评估器的列为目标来实现.反对这一观点的论点是,这需要大量工作来镜像已经存在的功能.

However, we are now having an internal about debate whether or not this is the most idiomatic way of implementing this pipeline. The other option would be to implement these transformations as series of UDFs and to build our own lineage tracking based on a DataFrame's schema history (or Spark's internal DF lineage tracking). The argument for this side is that Spark's ML pipelines are not intended just ETL jobs, and should always be implemented with goal of producing a column which can be fed to a Spark ML Evaluator. The argument against this side is that it requires a lot of work that mirrors already existing functionality.

严格地将Spark的ML管道用于ETL任务是否有问题?仅使用变形金刚而不包含评估器的任务?

Is there any problem with leveraging Spark's ML Pipelines strictly for ETL tasks? Tasks that only make use of Transformers and don't include Evaluators?

推荐答案

对我来说,这是一个好主意,尤其是如果您可以将生成的不同管道组合成新管道,因为管道本身可以由不同管道组成,因为管道从PipelineStage延伸到树上(来源:

For me, seems like a great idea, especially if you can compose the different Pipelines generated into new ones since a Pipeline can itself be made of different pipelines since a Pipeline extends from PipelineStage up the tree (source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline).

但是请记住,您可能将按照此处的说明进行相同的操作(

But keep in mind that you will probably being doing the same thing under the hood as explained here (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-mllib/spark-mllib-transformers.html):

在内部,transform方法使用Spark SQL的udf定义一个函数(基于上述的createTransformFunc函数),该函数将创建新的输出列(具有适当的outputDataType).稍后将UDF应用于输入DataFrame的输入列,结果成为输出列(使用DataFrame.withColumn方法).

Internally, transform method uses Spark SQL’s udf to define a function (based on createTransformFunc function described above) that will create the new output column (with appropriate outputDataType). The UDF is later applied to the input column of the input DataFrame and the result becomes the output column (using DataFrame.withColumn method).

如果您决定采用其他方法或找到更好的方法,请发表评论.很高兴分享有关Spark的知识.

If you have decided for other approach or found a better way, please, comment. It's nice to share knowledge about Spark.

这篇关于仅将Spark ML管道用于转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆