将 spark 特征转换管道导出到文件 [英] Export spark feature transformation pipeline to a file
问题描述
PMML、Mleap、PFA 目前仅支持基于行的转换.它们都不支持基于框架的转换,如聚合或 groupby 或 join.导出包含这些操作的火花管道的推荐方法是什么.
I see 2 options wrt Mleap:
1) 实现基于数据帧的转换器和 SQLTransformer
-Mleap 等价物.这个解决方案在概念上似乎是最好的(因为您总是可以将此类转换封装在管道元素中),但也需要大量工作.请参阅https://github.com/combust/mleap/issues/126>
2) 使用相应的操作扩展 DefaultMleapFrame
,您想要执行然后实际将所需的操作应用于在修改后的 MleapServing
子项目中传递给 restserver 的数据.
我实际上采用了 2) 并添加了 implode
、explode
和 join
作为 DefaultMleapFrame
的方法和还有一个允许快速连接的 HashIndexedMleapFrame
.我没有实现 groupby
和 agg
,但在 Scala 中这相对容易实现.
PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.
I see 2 options wrt Mleap:
1) implement dataframe based transformers and the SQLTransformer
-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126
2) extend the DefaultMleapFrame
with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing
subproject.
I actually went with 2) and added implode
, explode
and join
as methods to the DefaultMleapFrame
and also a HashIndexedMleapFrame
that allows for fast joins. I did not implement groupby
and agg
, but in Scala this is relatively easy to accomplish.
这篇关于将 spark 特征转换管道导出到文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!