将Spark特征转换管道导出到文件 [英] Export spark feature transformation pipeline to a file

查看:73
本文介绍了将Spark特征转换管道导出到文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

PMML,Mleap,PFA当前仅支持基于行的转换.它们都不支持基于帧的转换,例如聚合,分组或联接.导出包含这些操作的Spark管道的推荐方法是什么.

解决方案

我看到2个选项,其中包括Mleap:

1)实现基于数据帧的转换器和 SQLTransformer -Mleap等效项.从概念上讲,此解决方案似乎是最佳的(因为您始终可以将这样的转换封装在管道元素中),但也需要大量工作.参见 https://github.com/combust/mleap/issues/126

2)用要执行的相应操作扩展 DefaultMleapFrame ,然后在修改后的 MleapServing 子项目中将所需的操作实际应用于传递给Restserver的数据

我实际上使用了2),并向 DefaultMleapFrame DefaultMleapFrame 中添加了 implode explode join 作为方法.也是允许快速加入的 HashIndexedMleapFrame .我没有实现 groupby agg ,但是在Scala中,这相对容易实现.

PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.

解决方案

I see 2 options wrt Mleap:

1) implement dataframe based transformers and the SQLTransformer-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126

2) extend the DefaultMleapFrame with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing subproject.

I actually went with 2) and added implode, explode and join as methods to the DefaultMleapFrame and also a HashIndexedMleapFrame that allows for fast joins. I did not implement groupby and agg, but in Scala this is relatively easy to accomplish.

这篇关于将Spark特征转换管道导出到文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆