将Spark特征转换管道导出到文件 [英] Export spark feature transformation pipeline to a file
问题描述
PMML,Mleap,PFA当前仅支持基于行的转换.它们都不支持基于帧的转换,例如聚合,分组或联接.导出包含这些操作的Spark管道的推荐方法是什么.
我看到2个选项,其中包括Mleap:
1)实现基于数据帧的转换器和 SQLTransformer
-Mleap等效项.从概念上讲,此解决方案似乎是最佳的(因为您始终可以将这样的转换封装在管道元素中),但也需要大量工作.参见 https://github.com/combust/mleap/issues/126 >
2)用要执行的相应操作扩展 DefaultMleapFrame
,然后在修改后的 MleapServing
子项目中将所需的操作实际应用于传递给Restserver的数据
我实际上使用了2),并向 DefaultMleapFrame
和 DefaultMleapFrame
中添加了 implode
, explode
和 join
作为方法.也是允许快速加入的 HashIndexedMleapFrame
.我没有实现 groupby
和 agg
,但是在Scala中,这相对容易实现.
PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.
I see 2 options wrt Mleap:
1) implement dataframe based transformers and the SQLTransformer
-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126
2) extend the DefaultMleapFrame
with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing
subproject.
I actually went with 2) and added implode
, explode
and join
as methods to the DefaultMleapFrame
and also a HashIndexedMleapFrame
that allows for fast joins. I did not implement groupby
and agg
, but in Scala this is relatively easy to accomplish.
这篇关于将Spark特征转换管道导出到文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!