将 spark 特征转换管道导出到文件 [英] Export spark feature transformation pipeline to a file

查看:33
本文介绍了将 spark 特征转换管道导出到文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

PMML、Mleap、PFA 目前仅支持基于行的转换.它们都不支持基于框架的转换,如聚合或 groupby 或 join.导出包含这些操作的火花管道的推荐方法是什么.

解决方案

I see 2 options wrt Mleap:

1) 实现基于数据帧的转换器和 SQLTransformer-Mleap 等价物.这个解决方案在概念上似乎是最好的(因为您总是可以将此类转换封装在管道元素中),但也需要大量工作.请参阅https://github.com/combust/mleap/issues/126>

2) 使用相应的操作扩展 DefaultMleapFrame,您想要执行然后实际将所需的操作应用于在修改后的 MleapServing 子项目中传递给 restserver 的数据.

我实际上采用了 2) 并添加了 implodeexplodejoin 作为 DefaultMleapFrame 的方法和还有一个允许快速连接的 HashIndexedMleapFrame.我没有实现 groupbyagg,但在 Scala 中这相对容易实现.

PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.

解决方案

I see 2 options wrt Mleap:

1) implement dataframe based transformers and the SQLTransformer-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126

2) extend the DefaultMleapFrame with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing subproject.

I actually went with 2) and added implode, explode and join as methods to the DefaultMleapFrame and also a HashIndexedMleapFrame that allows for fast joins. I did not implement groupby and agg, but in Scala this is relatively easy to accomplish.

这篇关于将 spark 特征转换管道导出到文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆